Create a Remote Jobs Automated Scraping Bot on Nodejs and Puppeteer with (Express and Crontab)

Subscribe to my newsletter and never miss my upcoming articles

Video Tutorial

What is Web Scraping?

Web Scraping is a technique used to extract data out of websites into spreadsheets or databases on a server for data analytics or for creating bots for different purposes.

  What we are going to Create?

We will create remote jobs scraping bot which can automatically run and fetch the scraps the site every day and can serve the data through an Express server so we can enter to the website and see the new scraped job offering.

The site we are going to scrap is remoteok.io.

Note: Make sure to get the permission first before scraping a website.

Install Dependencies

We will use Puppeteer which is a Headless Browser API which can provide us with a chromium browser that we can control in the background pretty much like another browser.

For making the scraping automated we have to run the script every day (or depending on what you need) and this can be using CronTab which is a time job scheduler utility on Linux, which also available on Nodejs from the cron package.

Eventually, we will display the scraped jobs through an Express server and I will be using express-generator to scaffold an express project with Pug template engine.

Inspecting the Target Site (Remoteok.io)

The first step before scraping any website is to inspect the site content in order to be able how to build your scripts, cause scraping is a technique that depends on knowing the structure the website how the DOM is structured and know which HTML Elements with their attributes you will need to access.

For Inspection process, you can use Chrome Dev Tools or Mozilla’s Dev Tools.

On this tutorial, we scrap and get today’s jobs only so, on remoteok.io we need to inspect today’s jobs section to know how everything is put together.

Make sure to know the wrapping container of the element you want to scrap in order to be able to access the nested children and scrap them easily.

NOTE: Scraping scripts can be easily go outdated cause the target website content is constantly changing, it is a bit hard to keep it updated.

Create the Scraping Script

Make sure to take a look at the Puppeteer Docs to understand how it works.

Let’s first launch a browser on puppeteer and navigate to remoteok.io page.

We will save all jobs in an array (You can create a database and store all jobs there).

let jobs = [];

module.exports.run = async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://remoteok.io");


  await browser.close();
};

As you can we are using async/await to handle async promises as sync code cause all methods on puppeteer are promise based.

We are also exporting the main function to the module as run so we could use it outside and call it from our server.

Now, we need to look for todays jobs body and get all the jobs (Title, Company and technologies).

Today’s jobs are wrapped in tbody (table) container and each job is under tr.

for the title and company name, both have an attribute [itemprop=title] and [itemprop=hiringOrganization] so we could easily access those through the selectors.

async function loadLatestJobs(page) {
  //Clear previous jobs
  jobs = [];
  //Todays jobs container
  const todaysJobsBody = await page.$("tbody");
  //All rows of the container (row = job)
  const bodyRows = await todaysJobsBody.$$("tr");

  //Loop through all rows and extract data of the job
  const rowsMapping = bodyRows.map(async row => {
    //Get title element
    const jobTitleElement = await row.$("[itemprop=title]");
    if (jobTitleElement) {
      const titleValue = await getPropertyValue(jobTitleElement, "innerText");
      //Get company element
      const hiringOrganization = await row.$("[itemprop=hiringOrganization]");
      let organizitionName = "";
      if (hiringOrganization) {
        organizitionName = await getPropertyValue(
          hiringOrganization,
          "innerText"
        );
      }
    }
  });
  //Make sure to wait for all rows promises to complete before moving on
  //Otherwise we will get an error for closing the browser window before scraping the data
  await Promise.all(rowsMapping);
}

When using map on async function it returns an array of promises so we have to do Promise.all (wait for all promises to be resolved) then we can continue.

Now we need to get all the technologies regarding a specific job. Each technology is saved in a hyperlink (a) element and all tags are wrapped by .tags container.

async function loadLatestJobs(page) {
  //Clear previous jobs
  jobs = [];
  //Todays jobs container
  const todaysJobsBody = await page.$("tbody");
  //All rows of the container (row = job)
  const bodyRows = await todaysJobsBody.$$("tr");

  //Loop through all rows and extract data of the job
  const rowsMapping = bodyRows.map(async row => {
    //Get title element
    const jobTitleElement = await row.$("[itemprop=title]");
    if (jobTitleElement) {
      const titleValue = await getPropertyValue(jobTitleElement, "innerText");
      //Get company element
      const hiringOrganization = await row.$("[itemprop=hiringOrganization]");
      let organizitionName = "";
      if (hiringOrganization) {
        organizitionName = await getPropertyValue(
          hiringOrganization,
          "innerText"
        );
      }
      //Technologies elements (multiple tags for a single job)
      let technologies = [];
      const tags = await row.$$(".tag");
      technologies = await Promise.all(
        tags.map(async tag => {
          const tagContent = await tag.$("h3");
          return (
            await getPropertyValue(tagContent, "innerText")
          ).toLowerCase();
        })
      );
      //Remove all duplicates
      technologies = [...new Set(technologies)];
      //Add new Job
      addJob(titleValue, organizitionName, ...technologies);
    }
  });
  //Make sure to wait for all rows promises to complete before moving on
  //Otherwise we will get an error for closing the browser window before scraping the data
  await Promise.all(rowsMapping);
}

Also, make sure to add a helper function for adding a new job to the jobs array with title, company and technologies.

function addJob(title, company, ...technologies) {
  if (jobs) {
    const job = { title, company, technologies };
    jobs.push(job);
  }
}

Schedule script to run every day 

Crontab allows you to easily schedule scripts to run on a specific time interval.

const { CronJob } = require("cron");

const remoteJobsScrapper = require("./remotejobs-scraper");

console.log("Scheduler Started");
const fetchRemoteJobsJob = new CronJob("* * * * *", async () => {
  console.log("Fetching new Remote Jobs...");
  await remoteJobsScrapper.run();
  console.log("Jobs: ", jobs);
});
//You need to explicity start the cronjob 
fetchRemoteJobsJob.start();

onTick callback is the main script function which gets called every time the scheduled job runs.

Cron Job must be explicitly started to give a little more control over the jobs.

Run Server and Display Jobs

The server is just like a scraping bot which runs the scheduler that allows the scraping bot to run on the specified interval.

So go in app.js and add new get route on the server on /jobs route.

const remoteJobsScraper = require("../remotejobs-scraper");

app.get("/jobs", (req, res, next) => {
  //Get all fetched jobs and pass them to the index template for rendering
  res.render("index", {
    jobs: remoteJobsScraper.getJobs()
  });
});

Also, make sure to import the scheduler module in order to start the cron job once the server starts running.

The crown job will be automatically disposed of once the server is shutdown.

//Start Scheduler
require("../scheduler");

For displaying the jobs we will use Pug template engine.

extends layout

block content
  h1 Here is the List of your Today's Remote Jobs
  ul  
    for job in jobs
      span Title: #{job.title} 
      span Company: #{job.company} 
      span technologies: 
      for tech in job.technologies
        span #{tech} 
      br 
      br

Now, if you run the server on localhost:3000 and go to /jobs you should see today’s jobs scraped from remoteok.io.

No Comments Yet