在 JavaScript 中高效使用 API 处理海量数据

在使用处理大型数据集的 API 时，高效管理数据流并应对分页、速率限制和内存使用等挑战至关重要。本文将介绍如何使用 JavaScript 的原生fetch函数来调用 API。我们将探讨以下重要主题：

处理海量数据：逐步检索大型数据集，避免系统过载。
分页：大多数 API（包括 Storyblok 内容分发 API）都以分页形式返回数据。我们将探讨如何管理分页以实现高效的数据检索。
速率限制：API 通常会设置速率限制以防止滥用。我们将了解如何检测和处理这些限制。
重试机制：如果 API 返回 429 状态码（请求过多），我们将实现“重试机制”，该机制指示在重试之前等待多长时间，以确保顺利获取数据。
并发请求：并行获取多个页面可以加快处理速度。我们将使用 JavaScript 的Promise.all()并发请求功能来提升性能。
避免内存泄漏：处理大型数据集需要谨慎的内存管理。我们将分块处理数据，并借助生成器确保内存高效的操作。

我们将使用 Storyblok 内容分发 API 来探索这些技术，并解释如何使用 JavaScript 处理所有这些因素fetch。让我们深入了解代码。

使用 Storyblok 内容交付 API 时需要注意的事项

在深入代码之前，这里先介绍一下 Storyblok API 的一些关键特性：

CV 参数：cv内容版本 (Content Version) 参数用于检索缓存内容。该cv值在首次请求中返回，后续请求中应传递此值，以确保获取的是同一缓存版本的内容。
分页与page:per-page使用page和per_page参数来控制每次请求返回的项目数，并遍历结果页面。
总计标头：第一个响应的total标头指示可用项目的总数。这对于计算需要获取多少数据页至关重要。
处理 429（速率限制）：Storyblok 会强制执行速率限制；当达到限制时，API 会返回 429 状态码。使用Retry-After请求头（或默认值）可以确定在重试请求之前需要等待多长时间。

`fetch()`使用 JavaScript处理大型数据集的示例代码

以下是我如何使用 JavaScript 原生 fetch 函数实现这些概念的方法。
请注意：

这段代码会创建一个名为stories.jsonexample 的新文件。如果该文件已存在，则会被覆盖。因此，如果工作目录中已经存在同名文件，请修改代码片段中的名称。
由于请求是并行执行的，故事的顺序无法保证。例如，如果第三页的响应速度比第二页的响应速度快，生成器会先交付第三页的故事，然后再交付第二页的故事。
我用 Bun 测试了一下这段代码 :)

import { writeFile, appendFile } from "fs/promises";

// Read access token from Environment
const STORYBLOK_ACCESS_TOKEN = process.env.STORYBLOK_ACCESS_TOKEN;
// Read access token from Environment
const STORYBLOK_VERSION = process.env.STORYBLOK_VERSION;

/**
 * Fetch a single page of data from the API,
 * with retry logic for rate limits (HTTP 429).
 */
async function fetchPage(url, page, perPage, cv) {
  let retryCount = 0;
  // Max retry attempts
  const maxRetries = 5;
  while (retryCount <= maxRetries) {
    try {
      const response = await fetch(
        `${url}&page=${page}&per_page=${perPage}&cv=${cv}`,
      );
      // Handle 429 Too Many Requests (Rate Limit)
      if (response.status === 429) {
        // Some APIs provides you the Retry-After in the header
        // Retry After indicates how long to wait before retrying.
        // Storyblok uses a fixed window counter (1 second window)
        const retryAfter = response.headers.get("Retry-After") || 1;
        console.log(response.headers,
          `Rate limited on page ${page}. Retrying after ${retryAfter} seconds...`,
        );
        retryCount++;
        // In the case of rate limit, waiting 1 second is enough.
        // If not we will wait 2 second at the second tentative,
        // in order to progressively slow down the retry requests
        // setTimeout accept millisecond , so we have to use 1000 as multiplier
        await new Promise((resolve) => setTimeout(resolve, retryAfter * 1000 * retryCount));
        continue;
      }

      if (!response.ok) {
        throw new Error(
          `Failed to fetch page ${page}: HTTP ${response.status}`,
        );
      }
      const data = await response.json();
      // Return the stories data of the current page
      return data.stories || [];
    } catch (error) {
      console.error(`Error fetching page ${page}: ${error.message}`);
      return []; // Return an empty array if the request fails to not break the flow
    }
  }
  console.error(`Failed to fetch page ${page} after ${maxRetries} attempts`);
  return []; // If we hit the max retry limit, return an empty array
}

/**
 * Fetch all data in parallel, processing pages in batches
 * as a generators (the reason why we use the `*`)
 */
async function* fetchAllDataInParallel(
  url,
  perPage = 25,
  numOfParallelRequests = 5,
) {

  let currentPage = 1;
  let totalPages = null;

  // Fetch the first page to get:
  // - the total entries (the `total` HTTP header)
  // - the CV for caching (the `cv` atribute in the JSON response payload)
  const firstResponse = await fetch(
    `${url}&page=${currentPage}&per_page=${perPage}`,
  );
  if (!firstResponse.ok) {
    console.log(`${url}&page=${currentPage}&per_page=${perPage}`);
    console.log(firstResponse);
    throw new Error(`Failed to fetch data: HTTP ${firstResponse.status}`);
  }
  console.timeLog("API", "After first response");

  const firstData = await firstResponse.json();
  const total = parseInt(firstResponse.headers.get("total"), 10) || 0;
  totalPages = Math.ceil(total / perPage);

  // Yield the stories from the first page
  for (const story of firstData.stories) {
    yield story;
  }

  const cv = firstData.cv;

  console.log(`Total pages: ${totalPages}`);
  console.log(`CV parameter for caching: ${cv}`);

  currentPage++; // Start from the second page now

  while (currentPage <= totalPages) {
    // Get the list of pages to fetch in the current batch
    const pagesToFetch = [];
    for (
      let i = 0;
      i < numOfParallelRequests && currentPage <= totalPages;
      i++
    ) {
      pagesToFetch.push(currentPage);
      currentPage++;
    }

    // Fetch the pages in parallel
    const batchRequests = pagesToFetch.map((page) =>
      fetchPage(url, page, perPage, firstData, cv),
    );

    // Wait for all requests in the batch to complete
    const batchResults = await Promise.all(batchRequests);
    console.timeLog("API", `Got ${batchResults.length} response`);
    // Yield the stories from each batch of requests
    for (let result of batchResults) {
      for (const story of result) {
        yield story;
      }
    }
    console.log(`Fetched pages: ${pagesToFetch.join(", ")}`);
  }
}

console.time("API");
const apiUrl = `https://api.storyblok.com/v2/cdn/stories?token=${STORYBLOK_ACCESS_TOKEN}&version=${STORYBLOK_VERSION}`;
//const apiUrl = `http://localhost:3000?token=${STORYBLOK_ACCESS_TOKEN}&version=${STORYBLOK_VERSION}`;

const stories = fetchAllDataInParallel(apiUrl, 25,7);

// Create an empty file (or overwrite if it exists) before appending
await writeFile('stories.json', '[', 'utf8'); // Start the JSON array
let i = 0;
for await (const story of stories) {
  i++;
  console.log(story.name);
  // If it's not the first story, add a comma to separate JSON objects
  if (i > 1) {
    await appendFile('stories.json', ',', 'utf8');
  }
  // Append the current story to the file
  await appendFile('stories.json', JSON.stringify(story, null, 2), 'utf8');
}
// Close the JSON array in the file
await appendFile('stories.json', ']', 'utf8'); // End the JSON array
console.log(`Total Stories: ${i}`);

关键步骤详解

以下是确保使用 Storyblok 内容分发 API 实现高效可靠 API 调用的关键代码步骤分解：

1)使用重试机制获取页面（fetchPage）

此函数负责从 API 获取单页数据。它包含重试逻辑，用于在 API 返回 429（请求过多）状态码时进行重试，该状态码表示已超过速率限制。
该retryAfter值指定重试前的等待时间。我通常setTimeout会在发出后续请求前暂停一段时间，并且重试次数最多限制为 5 次。

2)初始页面请求和简历参数

第一个 API 请求至关重要，因为它会获取total请求头（指示故事总数）和cv参数（用于缓存）。
您可以使用total请求头计算所需的总页数，而该cv参数则确保使用缓存的内容。

3）处理分页

分页是通过page` --page per_page...total

4)并发请求Promise.all()：

为了加快处理速度，我们使用 JavaScript 的并行处理功能来获取多个页面Promise.all()。这种方法会同时发送多个请求，并等待所有请求完成。
每批并行请求完成后，系统会处理结果以生成故事。这样可以避免一次性将所有数据加载到内存中，从而降低内存消耗。

5)使用异步迭代的内存管理（for await...of）：

我们没有将所有数据收集到一个数组中，而是使用 JavaScript 生成器（`getStories()`function*和for await...of`getStories()`）来逐个处理获取到的故事。这可以防止处理大型数据集时出现内存过载。
通过逐个生成故事，代码保持高效并避免了内存泄漏。

6）速率限制处理：

如果 API 返回429状态码（速率限制），脚本会使用该retryAfter值。然后，脚本会暂停指定的时间，之后重试请求。这样可以确保符合 API 的速率限制，并避免在短时间内发送过多请求。

结论

本文介绍了在使用原生fetch函数调用 JavaScript API 时需要考虑的关键因素。我尝试处理以下问题：

大型数据集：使用分页获取大型数据集。
分页：使用参数管理page分页per_page。
速率限制和重试机制：处理速率限制，并在适当的延迟后重试请求。
并发请求：使用 JavaScript 并行获取页面Promise.all()以加快数据检索速度。
内存管理：使用 JavaScript 生成器（function*和for await...of）来处理数据，而不会消耗过多的内存。

通过应用这些技术，您可以以可扩展、高效且内存安全的方式处理 API 的使用。

欢迎留下您的评论/反馈。

参考

文章来源：https://dev.to/robertobutti/efficient-api-conspiration-for-huge-data-in-javascript-1i72

菜单

分享

在 JavaScript 中高效使用 API 处理海量数据

在 JavaScript 中高效使用 API 处理海量数据

使用 Storyblok 内容交付 API 时需要注意的事项

`fetch()`使用 JavaScript处理大型数据集的示例代码

关键步骤详解

结论

参考

系统设计面试中的 19 种微服务模式

使用 React 和 AWS Amplify 实现无服务器架构第三部分：跟踪应用使用情况

模型-视图-控制器（MVC）模式到底是什么？DEV 全球项目展示挑战赛，由 Mux 主办：快来展示你的项目吧！

我在两年内从 PHP 开发人员晋升为高级 C#/.NET 开发人员。

了解 Docker：第 12 部分 – 传递构建参数

Yarn 和第三方 NPM 客户端的黑暗未来 DEV 的全球展示与讲述挑战赛，由 Mux 呈现：展示你的项目！

CSS DEV 的全球展示挑战赛“响应式字体”由 Mux 呈现：展示你的项目！

我是如何以学生开发者的身份免费获得 Tabnine Pro 的，你也可以！

五大顶级JS框架

从 Rector PHP 开始：利用自动化改进您的 PHP 代码

分享

在 JavaScript 中高效使用 API 处理海量数据

在 JavaScript 中高效使用 API 处理海量数据

使用 Storyblok 内容交付 API 时需要注意的事项

fetch()使用 JavaScript处理大型数据集的示例代码

关键步骤详解

结论

参考

`fetch()`使用 JavaScript处理大型数据集的示例代码