Share


Web Scraping with Cheerio


By gobrain

Apr 15th, 2024

If you want to discover how to scrape web pages in NodeJS using Cheerio, this step-by-step guide is for you.

What is Web Scraping?

Web scraping is a technique used to extract data from websites by parsing the HTML or XML structure of web pages. It has many benefits from transforming data into a structured format, such as a spreadsheet or a database to automating some process.

What is Cheerio?

Cheerio is a fast and flexible implementation of the jQuery core designed specifically for server-side scraping of web pages. It provides an API that allows to traverse and manipulate the parsed HTML or XML document using familiar jQuery syntax in the Nodejs environment.

Setting Up the Project

To get started with web scraping using Cheerio, you need to set up a Node project. So, follow these steps to create a new project:

  1. Create a new directory for your project and navigate to it using the command line.
  2. Initialize a new Node project by running the following command: npm init -y. This will create a package.json file for your project.
  3. Install Cheerio as a dependency by executing the following command: npm install cheerio.

Now that your project is set up, you can start writing code to scrape web pages.

Scraping Web Pages with Cheerio In NodeJs

Once you import the cheerio package into your project, you can start scraping web pages by loading, parsing and manipulating them.

Load A Source

Before working with a HTML source, of course, you need to load it. Therefore, cheerio provides some load method options. One of the most popular methods is load.

The load method is used to load HTML content and create a Cheerio object that allows you to manipulate and traverse the parsed HTML structure. It accepts string argument representing the web page.

Here is an example:


    const cheerio = require("cheerio");
    
    // Example HTML content
    const html = `
      <html>
        <head>
          <title>Sample Page</title>
        </head>
        <body>
          <div id="container">
            <h1>Hello, Cheerio!</h1>
            <p>Welcome to web scraping with Cheerio.</p>
          </div>
        </body>
      </html>
    `;
    
    // Load the HTML content with Cheerio
    const $ = cheerio.load(html);
    
    // Get the text of the <h1> tag
    const headingText = $("h1").text();
    console.log(headingText); // Output: Hello, Cheerio!
    
    //  Get the text of the <p> tag
    const paragraphText = $("p").text();
    console.log(paragraphText); // Output: Welcome to web scraping with Cheerio.

Of course, it is not possible to pass a web page content string to the load method manually, this takes us the next step:

Cheerio With Axios

To pass a HTML document string to the load function, you firstly need to fetch a targeted source using the HTTP get request. To make a request to the source, we will use the Axios package that allows us make HTTP requests with ease.

const cheerio = require("cheerio");
const axios = require("axios");

async function scrape() {
  const url = "https://www.apple.com/"; // Replace with your desired URL

  try {
    // Step 1: Fetch the web page content
    const response = await axios.get(url);
    const html = response.data;

    // Step 2: Load the HTML content with Cheerio
    const $ = cheerio.load(html);

    // Step 3: Extract data from the web page
    // ... perform your desired operations using Cheerio methods

    // Example: Get the text of the <h1> tag
    const headingText = $("h1").text();
    console.log(headingText);
    // Apple
  } catch (error) {
    console.error(`Error fetching the URL: ${url}`);
  }
}

scrape();

In this example, we sent a get request to the Apple’s Website and passed the HTML document that Axios return to the load method. And, then simply we got the H1 text of the page, which is “Apple”.

Select Elements

Once you have loaded the source into your current project, it is time to select elements for manipulation, automation, or any processing you want to perform. Cheerio allows for elements to select with CSS selectors. These include tag names, class names, IDs, attribute selectors, and more.

const cheerio = require("cheerio");

const html = `
  <html>
    <head>
      <title>Sample Page</title>
    </head>
    <body>
      <div class="box" id="container">
        <h1>Hello, Cheerio!</h1>
        <p>Welcome to web scraping with Cheerio.</p>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
          <li>Item 3</li>
        </ul>
      </div>
    </body>
  </html>
`;

const $ = cheerio.load(html);

// Selecting elements by tag name
const headings = $("h1");
console.log(headings.text());
// Output: Hello, Cheerio!

// Selecting elements by class name
const paragraphs = $(".box p");
console.log(paragraphs.text());
// Output: Welcome to web scraping with Cheerio.

// Selecting elements by ID
const container = $("#container");
console.log(container.text());
// Output:
// Hello, Cheerio! Welcome to web scraping with Cheerio.
// Item 1
// Item 2
// Item 3

// Selecting multiple elements
const listItems = $("ul li");
listItems.each((index, element) => {
  console.log($(element).text());
});
// Output:
// Item 1
// Item 2
// Item 3

Traversing DOM

Traversing the DOM or navigating through the hierarchical structure of an HTML or XML document to locate specific elements is also can be performed using Cheerio. Cheerio provides a range of methods to move and filter elements in document. ü

These are:

Parent and Children

You can navigate to the parent or children of a selected element using the parent() and children() methods, respectively.

For example:

const cheerio = require("cheerio");

const html = `......`;

const $ = cheerio.load(html);

const heading = $("h1");
const parentElement = heading.parent();
console.log(parentElement.text());

/*
 Hello, Cheerio!
 Welcome to web scraping with Cheerio.

    Item 1
    Item 2
    Item 3      
*/

const list = $("ul");
const childElements = list.children();
console.log(childElements.text());
// Item 1 Item 2 Item 3

Siblings

Cheerio provides methods like siblings(), next(), and prev() to access the siblings of an element.

For instance:

const cheerio = require("cheerio");

const $ = cheerio.load(html);

const firstItem = $("li").first();
const nextItem = firstItem.next();
const previousItem = firstItem.prev();
const siblingItems = firstItem.siblings();

console.log(firstItem.text());
// Item 1

console.log(nextItem.text());
// Item 2

console.log(previousItem.text());
// ""

console.log(siblingItems.text());
// Item 2Item 3

Traversal by Selector

You can traverse the DOM using selectors to find specific elements within the current selection. Cheerio offers methods like find(), filter(), and closest() for this purpose.

For example,

const cheerio = require("cheerio");

const html = `
  <html>
    <head>
      <title>Sample Page</title>
    </head>
    <body>
      <div class="box" id="container">
        <h1>Hello, Cheerio!</h1>
        <p>Welcome to web scraping with Cheerio.</p>
        <p class="special">Special Text.</p>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
          <li>Item 3</li>
        </ul>
      </div>
    </body>
  </html>
`;

const $ = cheerio.load(html);

const container = $("#container");
const innerHeading = container.find("h1");
console.log(innerHeading.text());
// Hello, Cheerio!

const filteredElements = container.find("p").filter(".special");
console.log(filteredElements.text());
// Special Text.

const closestDiv = innerHeading.closest("div").prop("class");
console.log(closestDiv);
// box

Each and Map

Cheerio allows you to iterate over a selection of elements using the each() method. You can also transform a selection into an array using the map() method.

Here’s how you can use them:

$("li").each((index, element) => {
  // Process each li element here
});

const texts = $("p")
  .map((index, element) => $(element).text())
  .get();

console.log(texts);
// [ 'Welcome to web scraping with Cheerio.', 'Special Text.' ]

Manipulating DOM

After loading the HTML source and selecting the desired elements, you can utilize the powerful methods provided by Cheerio to manipulate and modify these elements. This includes performing tasks such as modifying element attributes and properties, adding or removing classes, altering text and HTML content, inserting or removing elements, as well as handling errors.

Modifying Text Content

You can change the text content of an element using the text() method.

For example:

const $ = cheerio.load(html);

const heading = $("h1");
heading.text("New Heading");

console.log($("h1").text());
// New Heading

Modifying HTML Content

Cheerio provides the html() method to modify the HTML content of an element.

Here’s an example:

const $ = cheerio.load(html);

const container = $("#container");
container.html("<h2>New Content</h2>");
console.log(container.html());
// <h2>New Content</h2>

Modifying Attributes and Properties

You can update or add attributes and properties to elements using the attr() method.

For example:

const $ = cheerio.load(html);

const h1 = $("h1");
h1.attr("class", "heading");
console.log(h1.prop("class"));
// heading

Modifying Classes

To modify classes for an element by adding, removing or toggling classes, you can use the addClass()removeClass(), and toggleClass() methods.

const $ = cheerio.load(html);

const h1 = $("h1");
h1.addClass("my-new-class my-other-class");
console.log(h1.prop("class"));
//my-new-class my-other-class
h1.removeClass("my-other-class");
console.log(h1.prop("class"));
// my-new-class

/* To toggle a class on an element, 
you can add the class if it is not present, and remove 
it if it already exists. */

h1.toggleClass("active");
console.log(h1.prop("class"));
//my-new-class active

Adding, Removing And Replacing Elements

You can add new elements or remove existing elements using methods like append(), prepend(), after(), before(), remove(), and replaceWith().

For example,

// Add an element to the end of a parent element
$('ul').append('<li>Item</li>');

// Add an element to the beginning of a parent element
$('ul').prepend('<li>Item</li>');

// Insert an element before a target element
$('li').before('<li>Item</li>');

// Insert an element after a target element
$('li').after('<li>Item</li>');

// Replace an element
$('li').replaceWith('<li>New Element</li>');

// Remove an element
$('li').find('h1').remove();

Conclusion

Web scraping is an important concept that has a wide range of use area for manipulation, automation or other purposes. Nodejs has various libraries for achieving these tasks, in this article we have discussed the cheerio package for scraping web content in nodejs along with the Axios package.

Thank you for reading.