如何从无限滚动页面抓取数据！♾️

处理动态加载的内容时，网页抓取会变得极具挑战性。现代网站通常使用无限滚动或“加载更多”按钮，以便在用户滚动页面时加载更多内容。虽然这提升了用户体验，但也使使用传统方法提取数据变得更加复杂。

在本教程中，您将学习如何从采用无限滚动机制的网站抓取数据。我们将逐步讲解如何获取、解析和保存数据，最终将其导出为 CSV 文件。

在本教程结束时，您将学会如何：

从网页中获取HTML内容。
模拟点击“加载更多”按钮以加载更多内容。
解析并提取HTML中的特定数据。
将提取的数据保存到 CSV 文件。

我们开始吧！

先决条件

为了开发这款网络爬虫，我们将使用 Node.js，以及一些开源库和 ZenRows API 来处理反爬虫机制。以下是我们将使用的工具和库列表：

Node.js ：用于在服务器端执行 JavaScript 代码的运行时环境。您可以从nodejs.org下载。
Axios：一个用于发出 HTTP 请求的库。
Cheerio：用于解析 HTML 的库。
csv-writer：一个用于将数据写入 CSV 文件的库。
ZenRows API：一种绕过反爬虫机制的服务。

首先，创建一个名为 `<project_name>` 的新 Node.js 项目web-scraper-tool。然后，在终端中运行以下命令来安装所需的库：



npm install axios cheerio csv-writer

基本设置完成后，就可以开始构建网络爬虫了。

步骤 1：获取 HTML 内容

首要任务是获取页面的 HTML 内容。这需要使用 ZenRows API 向目标 URL 发送请求。响应将包含页面的原始 HTML 代码，初始显示 12 个产品。

创建一个fetchHtml函数，使用 ZenRows API 从目标 URL 获取 HTML 内容。该函数应能处理 HTTP 请求和错误，并返回 HTML 数据以供后续处理。



const axios = require('axios');
const cheerio = require('cheerio');
const csv = require('csv-writer').createObjectCsvWriter;

const apiKey = 'ZENROWS_API_KEY'; // Replace with your ZenRows API key
const pageUrl = 'https://www.scrapingcourse.com/button-click'; // Page to be scraped

// Function to fetch HTML content from a URL
async function fetchHtml(url) {
   try {
       const response = await axios.get('https://api.zenrows.com/v1/', {
           params: { url, apiKey }
       });
       return response.data; // Return the HTML content
   } catch (error) {
       console.error(`Error fetching ${url}: ${error.message}`);
       return null; // Return null if an error occurs
   }
}

为了测试该fetchHtml功能，创建一个main函数来执行逻辑并打印获取到的 HTML。



// Main function to test and execute all the logic
async function main() {
   const html = await fetchHtml(pageUrl); // Fetch HTML content of the initial page
   if (html) {
       console.log(html);
   }
}
main();

使用命令 `node index.js` 运行代码。终端输出应显示页面的完整原始 HTML 代码。此 HTML 代码将作为数据提取过程的基础。

步骤 2：加载更多产品

获取完初始产品列表后，下一步是模拟多次点击“加载更多”按钮，以加载所有剩余页面。此步骤确保获取首页显示的产品列表之外的所有产品。

创建一个fetchAllProducts函数，通过向 AJAX 端点发送请求来模拟点击“加载更多”按钮。该函数应持续加载更多产品，直到加载到指定数量的产品为止。



const ajaxUrl = 'https://www.scrapingcourse.com/ajax/products'; // AJAX URL to load more products

// Function to fetch all products by simulating the "Load more" button
async function fetchAllProducts() {
  let productsHtml = [];
  let offset = 0;

  while (productsHtml.length < 48) {
      const newHtml = await fetchHtml(ajaxUrl, { offset });
      if (!newHtml) break; // Stop if no HTML is returned

      const $ = cheerio.load(newHtml);
      const products = $('div.product-item').map((_, element) => {
          return $(element).html();
      }).get();

      productsHtml.push(...products); // Collect the HTML content of the products
      offset += 12; // Increment offset to load the next set of products
      console.log(`Fetched ${productsHtml.length} products so far...`);
  }
  return productsHtml.join('\n'); // Join the HTML snippets into a single, cleaner string
}

更新main函数以测试该fetchAllProducts函数。



// Main function to test and execute all the logic
async function main() {
   const productsHtml = await fetchAllProducts();
   console.log(productsHtml); // Log the fetched products
}

运行代码后，终端应显示消息“*目前已获取 X 个产品… *”，后面跟着产品的原始 HTML 代码。

步骤 3：解析产品信息

获取到至少 48 个产品的原始 HTML 内容后，下一步是解析此 HTML 以提取具体的产品信息，例如标题、价格、图像 URL 和产品 URL。

创建一个parseProducts函数，从获取到的 HTML 中提取特定的产品信息，例如标题、价格、图片 URL 和产品 URL。使用 Cheerio 库来浏览和解析 HTML 内容。



// Function to parse product information from HTML
function parseProducts(html) {
   const $ = cheerio.load(html);
   return $('a[href*="/ecommerce/product/"]').map((_, item) => ({
       title: $(item).find('span.product-name').text().trim(),
       price: $(item).find('span.product-price').text().trim(),
       image: $(item).find('img').attr('src') || 'N/A',
       url: $(item).attr('href')
   })).get();
}

更新main函数，使其运行parseProducts并记录输出结果。



// Main function to test and execute all the logic
async function main() {
   const productsHtml = await fetchAllProducts();
   const products = parseProducts(productsHtml);
   console.log(products); // Log the parsed product information to the console
}

运行代码后，您将看到解析后的产品信息，这些信息将以对象数组的形式呈现，而不是像上一步那样显示原始 HTML 代码。终端输出应显示为一个对象数组，每个对象代表一个产品，包含产品标题、价格、图片 URL 和产品 URL。

步骤 4：将产品信息导出为 CSV 文件

成功解析数据后，下一步是将数据以结构化格式保存，以便进行后续分析。在此步骤中，解析后的数据将被写入 CSV 文件。CSV 文件因其简单易用且兼容性强，是存储表格数据的常用格式。

创建一个exportProductsToCSV函数，将解析后的产品数据写入 CSV 文件。使用 csv-writer 库定义文件结构并保存数据。



// Function to export products to a CSV file
async function exportProductsToCSV(products) {
   const csvWriter = csv({
       path: 'products.csv',
       header: [
           { id: 'title', title: 'Title' },
           { id: 'price', title: 'Price' },
           { id: 'image', title: 'Image URL' },
           { id: 'url', title: 'Product URL' }
       ]
   });

   await csvWriter.writeRecords(products);
   console.log('CSV file has been created.');
}

更新main函数以运行该exportProductsToCSV函数。



// Main function to test and execute all the logic
async function main() {
   const productsHtml = await fetchAllProducts();
   const products = parseProducts(productsHtml);
   await exportProductsToCSV(products); // Export products to CSV
}

运行代码后，您应该会products.csv在工作目录中看到一个包含解析后的产品信息的文件。同时，终端也会显示一条消息，确认 CSV 文件已创建。

第五步：获取热门产品的更多数据

最后一步，我们将重点优化抓取流程，获取价格最高的五款产品的更多详细信息。这需要访问每款产品的页面，提取所需信息，例如产品描述和 SKU 代码。

创建一个getProductDetails函数，从每个产品的单独页面中获取产品描述和 SKU 代码等其他详细信息。



// Function to fetch additional product details from the product page
async function getProductDetails(url) {
   const html = await fetchHtml(url);
   if (!html) return { description: 'N/A', sku: 'N/A' };

   const $ = cheerio.load(html);
   return {
       description: $("div.woocommerce-Tabs-panel--description p").map((_, p) => $(p).text().trim()).get().join(' ') || 'N/A',
       sku: $(".product_meta .sku").text().trim() || 'N/A'
   };
}

最后，更新exportProductsToCSV函数，使其包含价格最高的 5 种产品的新数据。



// Function to export products to a CSV file
async function exportProductsToCSV(products) {
   const csvWriter = csv({
       path: 'products.csv',
       header: [
           { id: 'title', title: 'Title' },
           { id: 'price', title: 'Price' },
           { id: 'image', title: 'Image URL' },
           { id: 'url', title: 'Product URL' },
           { id: 'description', title: 'Description' },
           { id: 'sku', title: 'SKU' }
       ]
   });

   await csvWriter.writeRecords(products);
   console.log('CSV file with additional product details has been created.');
}

最后，更新main函数以获取更多详细信息，并将丰富的产品数据导出到 CSV 文件。



// Main function to test and execute all the logic
async function main() {
   const productsHtml = await fetchAllProducts();
   const products = parseProducts(productsHtml);

   // Sort products by price in descending order
   products.sort((a, b) => parseFloat(b.price.replace(/[^0-9.-]+/g, "")) - parseFloat(a.price.replace(/[^0-9.-]+/g, "")));

   // Fetch additional details for the top 5 highest-priced products
   for (let i = 0; i < Math.min(5, products.length); i++) {
       const details = await getProductDetails(products[i].url);
       products[i] = { ...products[i], ...details };
   }

   await exportProductsToCSV(products); // Export products to CSV
   console.log('CSV file with additional product details has been created.');
}

运行代码后，您将看到一个 CSV 文件，其中包含价格最高的五个产品，每个产品都包含产品描述和 SKU 代码等详细信息。

注意：抓取到的产品数量（48）是基于fetchAllProducts函数中先前设置的限制。如果您想在确定前五名产品之前抓取更多产品，可以调整此限制。

结论

按照这些步骤，您已经成功构建了一个能够处理带有无限滚动或“加载更多”按钮的动态网页的网络爬虫。有效网络爬虫的关键在于理解目标网站的结构，并使用工具来导航和绕过反爬虫措施。

为了进一步提升您的网络爬虫技能，请考虑实施以下措施：

使用轮换代理服务器可以避免IP封禁。
探索应对验证码挑战的技术。
抓取包含更多 AJAX 调用或嵌套“加载更多”按钮的更复杂的网站。

本教程为抓取动态内容奠定了坚实的基础，现在您可以将这些原理应用到其他网络抓取项目中。

文章来源：https://dev.to/karanrathod316/how-to-scrape-data-from-a-page-with-infinite-scroll-2enk

菜单

分享

如何从无限滚动页面抓取数据！♾️

如何从无限滚动页面抓取数据！♾️

先决条件

步骤 1：获取 HTML 内容

步骤 2：加载更多产品

步骤 3：解析产品信息

步骤 4：将产品信息导出为 CSV 文件

第五步：获取热门产品的更多数据

结论

系统设计面试中的 19 种微服务模式

使用 React 和 AWS Amplify 实现无服务器架构第三部分：跟踪应用使用情况

模型-视图-控制器（MVC）模式到底是什么？DEV 全球项目展示挑战赛，由 Mux 主办：快来展示你的项目吧！

我在两年内从 PHP 开发人员晋升为高级 C#/.NET 开发人员。

了解 Docker：第 12 部分 – 传递构建参数

Yarn 和第三方 NPM 客户端的黑暗未来 DEV 的全球展示与讲述挑战赛，由 Mux 呈现：展示你的项目！

CSS DEV 的全球展示挑战赛“响应式字体”由 Mux 呈现：展示你的项目！

我是如何以学生开发者的身份免费获得 Tabnine Pro 的，你也可以！

五大顶级JS框架

从 Rector PHP 开始：利用自动化改进您的 PHP 代码