Share


NodeJS Web Scraping with Cheerio


By gobrain

Jun 21st, 2024

Web scraping is a technique used to extract data from websites by parsing the HTML or XML structure of web pages. It has many benefits from transforming data into a structured format, such as a spreadsheet or a database to automating some process.

If you want to discover how to scrape web pages in NodeJS using Cheerio, this step-by-step guide is for you.

What is Cheerio?

Cheerio is a popular library for working with HTML and XML in Node.js environments. Unlike a full web browser, Cheerio focuses solely on parsing the structure of HTML or XML. This makes it incredibly fast and efficient for tasks like data extraction.

Cheerio adopts a subset of the syntax from jQuery, a widely used library for manipulating the Document Object Model (DOM) in web browsers. This familiarity makes it easy to learn and use, especially for those already comfortable with jQuery.

Cheerio is popular among developers. It has almost 8 million downloads on NPM and 28 thousand stars on GitHub. This is not a surprise considering its powerful API. Now, let's set up a Node project and learn how to scrape web pages in Nodejs with Cheerio.

Setting Up the Project

To get started with web scraping using Cheerio, you need to set up a Node project. So, follow these steps to create a new project:

  • Create a new directory for your project and navigate to it using the command line.
  • Initialize a new Node project by running the following command: npm init -y. This will create a package.json file for your project.
  • Install Cheerio as a dependency by executing the following command: npm install cheerio.

Now that your project is set up, you can start scraping web pages.

How To Scrape Web Pages In NodeJs with Cheerio

Once you import the cheerio package into your project, you can start scraping web pages by loading, parsing and manipulating them.

How To Load A Source

Before working with a HTML source, of course, you need to load it. Therefore, cheerio provides some load method options. One of the most popular methods is load.

The load method is used to load HTML content and create a Cheerio object that allows you to manipulate and traverse the parsed HTML structure. It accepts string argument representing the web page.

Here is an example:

const cheerio = require("cheerio");
    
 // Example HTML content
 const html = `
   <html>
     <head>
       <title>Sample Page</title>
     </head>
     <body>
       <div id="container">
         <h1>Hello, Cheerio!</h1>
         <p>Welcome to web scraping with Cheerio.</p>
       </div>
     </body>
   </html>`;
    
 // Load the HTML content with Cheerio
 const $ = cheerio.load(html);
    
 // Get the text of the <h1> tag
 const headingText = $("h1").text();
 console.log(headingText); // Output: Hello, Cheerio!
    
 //  Get the text of the <p> tag
 const paragraphText = $("p").text();
 console.log(paragraphText); // Output: Welcome to web scraping with Cheerio.

Of course, it is not possible to pass a web page content string to the load method manually, this takes us the next step:

How to Fetch HTML with Axios

To pass a HTML document string to the load function, you firstly need to fetch a targeted source using the HTTP get request. To make a request to the source, we will use the Axios package that allows us make HTTP requests with ease.

 const cheerio = require("cheerio");
 const axios = require("axios");
    
 async function scrape() {
   const url = "https://www.apple.com/"; // Replace with your desired URL
    
   try {
     // Step 1: Fetch the web page content
     const response = await axios.get(url);
     const html = response.data;
    
     // Step 2: Load the HTML content with Cheerio
     const $ = cheerio.load(html);
    
     // Step 3: Extract data from the web page
     // ... perform your desired operations using Cheerio methods
    
     // Example: Get the text of the <h1> tag
     const headingText = $("h1").text();
     console.log(headingText);
     // Apple
   } catch (error) {
     console.error(`Error fetching the URL: ${url}`);
   }
 }
    
 scrape();

In this example, we sent a get request to the Apple’s Website and passed the HTML document that Axios return to the load method. And, then simply we got the H1 text of the page, which is “Apple”.

Select Elements

Once you have loaded the source into your current project, it is time to select elements for manipulation, automation, or any processing you want to perform. Cheerio allows for elements to select with CSS selectors. These include tag names, class names, IDs, attribute selectors, and more.

 const cheerio = require("cheerio");
    
 const html = `
   <html>
     <head>
       <title>Sample Page</title>
     </head>
     <body>
       <div class="box" id="container">
         <h1>Hello, Cheerio!</h1>
         <p>Welcome to web scraping with Cheerio.</p>
         <ul>
           <li>Item 1</li>
           <li>Item 2</li>
           <li>Item 3</li>
         </ul>
       </div>
     </body>
   </html>
 `;
    
 const $ = cheerio.load(html);
    
 // Selecting elements by tag name
 const headings = $("h1");
 console.log(headings.text());
 // Output: Hello, Cheerio!
    
 // Selecting elements by class name
 const paragraphs = $(".box p");
 console.log(paragraphs.text());
 // Output: Welcome to web scraping with Cheerio.
    
 // Selecting elements by ID
 const container = $("#container");
 console.log(container.text());
 // Output:
 // Hello, Cheerio! Welcome to web scraping with Cheerio.
 // Item 1
 // Item 2
 // Item 3
    
 // Selecting multiple elements
 const listItems = $("ul li");
 listItems.each((index, element) => {
   console.log($(element).text());
 });
 // Output:
 // Item 1
 // Item 2
 // Item 3

Traversing DOM

Traversing the DOM or navigating through the hierarchical structure of an HTML or XML document to locate specific elements is also can be performed using Cheerio. Cheerio provides a range of methods to move and filter elements in document. ü

These are:

Parent and Children

You can navigate to the parent or children of a selected element using the parent() and children() methods, respectively.

For example:

const cheerio = require("cheerio");

const html = `......`;

const $ = cheerio.load(html);

const heading = $("h1");
const parentElement = heading.parent();
console.log(parentElement.text());

/*
 Hello, Cheerio!
 Welcome to web scraping with Cheerio.

    Item 1
    Item 2
    Item 3      
*/

const list = $("ul");
const childElements = list.children();
console.log(childElements.text());
// Item 1 Item 2 Item 3

Siblings

Cheerio provides methods like siblings(), next(), and prev() to access the siblings of an element.

For instance:

const cheerio = require("cheerio");

const $ = cheerio.load(html);

const firstItem = $("li").first();
const nextItem = firstItem.next();
const previousItem = firstItem.prev();
const siblingItems = firstItem.siblings();

console.log(firstItem.text());
// Item 1

console.log(nextItem.text());
// Item 2

console.log(previousItem.text());
// ""

console.log(siblingItems.text());
// Item 2Item 3

Traversal by Selector

You can traverse the DOM using selectors to find specific elements within the current selection. Cheerio offers methods like find(), filter(), and closest() for this purpose.

For example,

const cheerio = require("cheerio");

const html = `
  <html>
    <head>
      <title>Sample Page</title>
    </head>
    <body>
      <div class="box" id="container">
        <h1>Hello, Cheerio!</h1>
        <p>Welcome to web scraping with Cheerio.</p>
        <p class="special">Special Text.</p>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
          <li>Item 3</li>
        </ul>
      </div>
    </body>
  </html>
`;

const $ = cheerio.load(html);

const container = $("#container");
const innerHeading = container.find("h1");
console.log(innerHeading.text());
// Hello, Cheerio!

const filteredElements = container.find("p").filter(".special");
console.log(filteredElements.text());
// Special Text.

const closestDiv = innerHeading.closest("div").prop("class");
console.log(closestDiv);
// box

Each and Map

Cheerio allows you to iterate over a selection of elements using the each() method. You can also transform a selection into an array using the map() method.

Here’s how you can use them:

$("li").each((index, element) => {
  // Process each li element here
});

const texts = $("p")
  .map((index, element) => $(element).text())
  .get();

console.log(texts);
// [ 'Welcome to web scraping with Cheerio.', 'Special Text.' ]

Manipulating DOM

After loading the HTML source and selecting the desired elements, you can utilize the powerful methods provided by Cheerio to manipulate and modify these elements. This includes performing tasks such as modifying element attributes and properties, adding or removing classes, altering text and HTML content, inserting or removing elements, as well as handling errors.

Modifying Text Content

You can change the text content of an element using the text() method.

For example:

const $ = cheerio.load(html);

const heading = $("h1");
heading.text("New Heading");

console.log($("h1").text());
// New Heading

Modifying HTML Content

Cheerio provides the html() method to modify the HTML content of an element.

Here’s an example:

const $ = cheerio.load(html);

const container = $("#container");
container.html("<h2>New Content</h2>");
console.log(container.html());
// <h2>New Content</h2>

Modifying Attributes and Properties

You can update or add attributes and properties to elements using the attr() method.

For example:

const $ = cheerio.load(html);

const h1 = $("h1");
h1.attr("class", "heading");
console.log(h1.prop("class"));
// heading

Modifying Classes

To modify classes for an element by adding, removing or toggling classes, you can use the addClass()removeClass(), and toggleClass() methods.

const $ = cheerio.load(html);

const h1 = $("h1");
h1.addClass("my-new-class my-other-class");
console.log(h1.prop("class"));
//my-new-class my-other-class
h1.removeClass("my-other-class");
console.log(h1.prop("class"));
// my-new-class

/* To toggle a class on an element, 
you can add the class if it is not present, and remove 
it if it already exists. */

h1.toggleClass("active");
console.log(h1.prop("class"));
//my-new-class active

Adding, Removing And Replacing Elements

You can add new elements or remove existing elements using methods like append(), prepend(), after(), before(), remove(), and replaceWith().

For example,

// Add an element to the end of a parent element
$('ul').append('<li>Item</li>');

// Add an element to the beginning of a parent element
$('ul').prepend('<li>Item</li>');

// Insert an element before a target element
$('li').before('<li>Item</li>');

// Insert an element after a target element
$('li').after('<li>Item</li>');

// Replace an element
$('li').replaceWith('<li>New Element</li>');

// Remove an element
$('li').find('h1').remove();

Conclusion

Web scraping is an important concept that has a wide range of use area for manipulation, automation or other purposes. Nodejs has various libraries for achieving these tasks, in this article we have discussed the cheerio package for scraping web content in nodejs along with the Axios package.

Thank you for reading.