如何用大语言模型提取网页数据？Lightfeed Extractor实测指南：原理解析、实操步骤、常见问题与优化建议

Robust Web Data Extractor Using LLMs

概述 | Overview

Lightfeed Extractor 是一个基于 TypeScript 构建的库，旨在利用大语言模型（LLMs）进行鲁棒的网页数据提取。您可以使用自然语言提示，从 HTML、Markdown 或纯文本中提取结构化数据。该库能够以极高的令牌效率获得完整、准确的结果——这对于生产环境的数据管道至关重要。

核心特性 | Features

🧹 LLM 就绪的 Markdown - 将 HTML 转换为适合 LLM 处理的 Markdown 格式，并提供提取主要内容、通过移除跟踪参数清理 URL 等选项。
⚡️ LLM 提取 - 在 JSON 模式下使用 LLM，根据输入的 Zod 模式提取结构化数据。包含令牌使用限制和跟踪功能。
🛠️ JSON 恢复 - 清理和恢复失败的 JSON 输出。这使得复杂模式（尤其是深度嵌套的对象和数组）的提取更加鲁棒。
🔗 URL 验证 - 处理相对 URL，移除无效链接，并修复 Markdown 转义后的链接。
🤖 与 Playwright 协同工作 - 使用 Playwright 加载页面，然后从 HTML 内容中提取结构化数据。
🧭 AI 浏览器导航 - 与 @lightfeed/browser-agent 配合使用，在提取结构化数据之前，通过自然语言命令导航页面。

安装 | Installation

安装 extractor 以及 @langchain/core 和您选择的 LLM 提供商：

npm install @lightfeed/extractor @langchain/core

然后添加您的 LLM 提供商（我们使用 LangChain 以确保互操作性）：

npm install @langchain/openai         # OpenAI
npm install @langchain/google-genai   # Google Gemini
npm install @langchain/anthropic      # Anthropic
npm install @langchain/ollama         # Ollama (local models)

使用指南 | Usage

电商产品数据提取 | E-commerce Product Extraction

此示例演示了如何使用 Playwright 加载页面，并使用 extractor 从电商网站提取结构化产品数据。

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { chromium } from "playwright";
import { extract, ContentFormat } from "@lightfeed/extractor";
import { z } from "zod";

// Define schema for product catalog extraction
const productCatalogSchema = z.object({
  products: z
    .array(
      z.object({
        name: z.string().describe("Product name or title"),
        brand: z.string().optional().describe("Brand name"),
        price: z.number().describe("Current price"),
        originalPrice: z
          .number()
          .optional()
          .describe("Original price if on sale"),
        rating: z.number().optional().describe("Product rating out of 5"),
        reviewCount: z.number().optional().describe("Number of reviews"),
        productUrl: z.string().url().describe("Link to product detail page"),
        imageUrl: z.string().url().optional().describe("Product image URL"),
      })
    )
    .describe("List of bread and bakery products"),
});

const browser = await chromium.launch();
const page = await browser.newPage();

const pageUrl = "https://www.walmart.ca/en/browse/grocery/bread-bakery/10019_6000194327359";
await page.goto(pageUrl);

try {
  await page.waitForLoadState("networkidle", { timeout: 10000 });
} catch {
  console.log("Network idle timeout, continuing...");
}

const html = await page.content();
await browser.close();

// Extract structured product data
const result = await extract({
  llm: new ChatGoogleGenerativeAI({
    apiKey: process.env.GOOGLE_API_KEY,
    model: "gemini-2.5-flash",
    temperature: 0,
  }),
  content: html,
  format: ContentFormat.HTML,
  sourceUrl: pageUrl,
  schema: productCatalogSchema,
  htmlExtractionOptions: {
    extractMainHtml: true,
    includeImages: true,
    cleanUrls: true
  }
});

console.log("Found products:", result.data.products.length);
console.log(JSON.stringify(result.data, null, 2));

/* Expected output:
{
  "products": [
    {
      "name": "Dempster's® Signature The Classic Burger Buns, Pack of 8; 568 g",
      "brand": "Dempster's",
      "price": 3.98,
      "originalPrice": 4.57,
      "rating": 4.7376,
      "reviewCount": 141,
      "productUrl": "https://www.walmart.ca/en/ip/dempsters-signature-the-classic-burger-buns/6000188080451?classType=REGULAR&athbdg=L1300",
      "imageUrl": "https://i5.walmartimages.ca/images/Enlarge/725/979/6000196725979.jpg?odnHeight=580&odnWidth=580&odnBg=FFFFFF"
    },
    ... (more products)
  ]
}
*/

与浏览器智能体协同使用 | Using with Browser Agent

对于在提取前需要交互的页面——例如搜索、点击分页、关闭弹窗等——您可以将本库与 @lightfeed/browser-agent 配合使用。浏览器智能体利用 AI 通过自然语言命令导航页面，然后本库从结果中提取结构化数据。

安装两个包：

npm install @lightfeed/extractor @lightfeed/browser-agent

然后使用浏览器智能体进行导航，并使用提取器提取结构化数据：

import { BrowserAgent } from "@lightfeed/browser-agent";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat } from "@lightfeed/extractor";
import { z } from "zod";

const schema = z.object({
  products: z.array(
    z.object({
      name: z.string(),
      price: z.number(),
      rating: z.number().optional(),
      productUrl: z.string().url(),
    })
  ),
});

// 1. Use browser agent to navigate with AI
const agent = new BrowserAgent({ browserProvider: "Local" });
const page = await agent.newPage();
await page.goto("https://amazon.com");
await page.ai("Search for 'org

常见问题（FAQ）

Lightfeed Extractor 如何确保从网页提取数据的准确性？

该库利用LLM在JSON模式下根据Zod模式提取结构化数据，并包含JSON恢复功能来清理和修复失败的输出，确保复杂嵌套数据的提取更加鲁棒和准确。

Lightfeed Extractor 能处理电商网站的产品信息抓取吗？

可以。它专门提供了电商产品数据提取的用例，能够与Playwright配合加载页面，然后使用自然语言提示从HTML中提取结构化的产品数据，适用于生产环境。

安装Lightfeed Extractor需要注意什么依赖项？

必须显式安装@langchain/core作为对等依赖项，以避免版本冲突。同时需要安装@lightfeed/extractor以及您选择的LLM提供商包（如@langchain/openai）。

如何用大语言模型提取网页数据？Lightfeed Extractor实测指南

AIAI Summary (BLUF)

概述 | Overview

核心特性 | Features

安装 | Installation

使用指南 | Usage

电商产品数据提取 | E-commerce Product Extraction

与浏览器智能体协同使用 | Using with Browser Agent

常见问题（FAQ）

Lightfeed Extractor 如何确保从网页提取数据的准确性？

Lightfeed Extractor 能处理电商网站的产品信息抓取吗？

安装Lightfeed Extractor需要注意什么依赖项？

深度实测：GLM-5.2长上下文与Kimi K2.7国际化，差距在哪

实测OpenAI API：gpt-3.5和gpt-4差距到底在哪

RAG七步工作流：分块做不对，后面全是白费

OpenAI有哪些AI模型？2026年GPT-4与GPT-3.5等如何选择

AIAI Summary (BLUF)

概述 | Overview

核心特性 | Features

安装 | Installation

使用指南 | Usage

电商产品数据提取 | E-commerce Product Extraction

与浏览器智能体协同使用 | Using with Browser Agent

常见问题（FAQ）

Lightfeed Extractor 如何确保从网页提取数据的准确性？

Lightfeed Extractor 能处理电商网站的产品信息抓取吗？

安装Lightfeed Extractor需要注意什么依赖项？

相关文章

深度实测：GLM-5.2长上下文与Kimi K2.7国际化，差距在哪

实测OpenAI API：gpt-3.5和gpt-4差距到底在哪

RAG七步工作流：分块做不对，后面全是白费

OpenAI有哪些AI模型？2026年GPT-4与GPT-3.5等如何选择