Web Scrapping with Puppeteer

My notes on creating a scrapper with puppeteer

Why create a web scrapper?

As developers, we reach a point where we want to automate every single task we do online. We can create a scrapper to get data from lots of different sources, create files from it, visit links and search for a pattern, download files from URLs…

Today I had a case where I had to visit several links, and for each one of them, click a button to download a PDF file. There were some gotchas I think it’s worth it to share as a blog post.

Why Puppeteer?

There are a lot of ways to do web scrapping around. First decision you have to make is the programming language you’ll choose. I initially planned to create the scrapper using Golang’s Colly. I already used it for getting data from news websites and outputing the result at the terminal. It was part of a project I was working on 🙂.

But this time, I had to interact with the DOM by clicking a button, and unfortunately Colly can’t programmatically do that. So I had to move on and ended up with the famous NodeJS puppetter. It is easy to get started and easy to handle events in the browser.

Getting started

I decided to create the whole scrapper code in one file, since it’s not too complex and it’s easier to manage. Start by creating a NodeJS project, creating your folder and running the necessary commands:

mkdir scrapper
cd ./scrapper
npm init -y

Now install puppeteer as a dependency:

npm install --save puppeteer

Nice! Now we can start coding our scrapper 🤘.

Basic Puppetter commands

After requiring the puppeteer lib at the top of the file, you can run the launch command, passing some options as an object. In this case, I just wanted to make sure the puppeteer runed as headless. This means it won’t open a virtual browser, it will run detached from any user interface.

const puppeteer = require("puppeteer");

async function scrapper() {
  const browser = await puppeteer.launch({ headless: true });
}

scrapper();

Now I won’t put the whole code in this blog post, but I will highlight the main commands I used to create my scrapper.

First one is to create an instance of a new browser tab (page). We can do this by writing:

const page = await browser.newPage();

We will use this page instance through the whole script.

To navigate somewhere we can simply:

await page.goto(SOME_URL, { waitUntil: "networkidle2" });

Note the object as second param. We can specify some options, such as wait until the network is idle before executing the next command.

As I said, the main benefit of using Puppeteer was the possibility to interact with the DOM. We can use a lot of built-in methods to select DOM elements, wait for them to appear in the document, interact with them, etc..

To wait for some element to appear before executing the next command:

await page.waitFor("input[id=name]");

To select an element and do something with it, we can use $eval. For example, selecting an input and filling with data:

await page.$eval("input[id=name]", (el) => (el.value = "John"));

To click and element:

await page.click('input[type="submit"]');

Now imagine we need to visit some specific links in the website. We can get all links by doing:

const allLinks = await page.$$eval("a.link-class", (links) =>
  links.map((a) => a.href)
);

Note the the use of $$eval above. It’s like querySelectorAll if we translate to vanilla javascript. The returned value stored into allLinks variable is an array of all href, so we can iterate through them and do whatever we want 😛.

A gotcha I struggled a little bit

I had to click an element inside an iframe! And by clicking this button I had to download a PDF and stored in my own computer, in some path I specify. So I had to make some adjustments to make it work.

First, since we are talking about an iframe, we have to remember the iframe has it’s own #document tree. So let’s get it, in order to get the button to be clicked:

const elementIframe = await page.$("#iframecontent");
const iframe = await elementIframe.contentFrame();

Note, I first got the iframe element by getting it’s id. Then I reached the whole iframe content using the built in method contentFrame, which returned the iframe document tree. Now, I can reference iframe to get my button and clicking it.

await iframe.click("#download");

Is it done? Not quite. Remember I told you I wanted to download a PDF at each button click, into some specified path in my own computer? So, if we mantain the code as it is, puppeteer will download the files into the virtual browser, and in the moment we shut down puppeteer, all our downloaded files will go away. So we need this next piece of code, before clicking the button to download the PDF:

await page._client.send("Page.setDownloadBehavior", {
  behavior: "allow",
  downloadPath: "./myDownloads",
});

The downloadPath is at the root of your project, so in my case is in the scrapper folder, together with our package.json and index.js files. Remember the beggining of this post, when we create our scrapper folder?

So now we are done! Puppeteer will download the files and we can securely access them after puppeteer shut down.

I hope this post throw some light on web scrapping using puppeteer. It’s just the beggining of possibilities, you can always consult the docs or search another amazing articles around the web!

See you around 👋