如何用大语言模型提取网页数据？Lightfeed Extractor实测指南

Robust Web Data Extractor Using LLMs

概述 | Overview

Lightfeed Extractor 是一个基于 TypeScript 构建的库，旨在利用大语言模型（LLMs）进行鲁棒的网页数据提取。您可以使用自然语言提示，从 HTML、Markdown 或纯文本中提取结构化数据。该库能够以极高的令牌效率获得完整、准确的结果——这对于生产环境的数据管道至关重要。

Lightfeed Extractor is a TypeScript library built for robust web data extraction using LLMs. Use natural language prompts to extract structured data from HTML, markdown, or plain text. Get complete, accurate results with great token efficiency — critical for production data pipelines.

核心特性 | Features

🧹 LLM 就绪的 Markdown - 将 HTML 转换为适合 LLM 处理的 Markdown 格式，并提供提取主要内容、通过移除跟踪参数清理 URL 等选项。

🧹 LLM-ready Markdown - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters.
⚡️ LLM 提取 - 在 JSON 模式下使用 LLM，根据输入的 Zod 模式提取结构化数据。包含令牌使用限制和跟踪功能。

⚡️ LLM Extraction - Use LLMs in JSON mode to extract structured data according to input Zod schema. Token usage limit and tracking included.
🛠️ JSON 恢复 - 清理和恢复失败的 JSON 输出。这使得复杂模式（尤其是深度嵌套的对象和数组）的提取更加鲁棒。

🛠️ JSON Recovery - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays.
🔗 URL 验证 - 处理相对 URL，移除无效链接，并修复 Markdown 转义后的链接。

🔗 URL Validation - Handle relative URLs, remove invalid ones, and repair markdown-escaped links.
🤖 与 Playwright 协同工作 - 使用 Playwright 加载页面，然后从 HTML 内容中提取结构化数据。

🤖 Works with Playwright - Use Playwright to load pages, then extract structured data from the HTML content.
🧭 AI 浏览器导航 - 与 @lightfeed/browser-agent 配合使用，在提取结构化数据之前，通过自然语言命令导航页面。

🧭 AI Browser Navigation - Pair with @lightfeed/browser-agent to navigate pages using natural language commands before extracting structured data.

提示 | Tip
正在大规模构建零售竞争对手情报系统？请访问 lightfeed.ai —— 这是我们用于跟踪竞争对手定价、销售、促销和 SEO 的完整平台。

Building retail competitor intelligence at scale? Go to lightfeed.ai - our full platform for tracking competitor pricing, sales, promotions, and SEO.

安装 | Installation

安装 extractor 以及 @langchain/core 和您选择的 LLM 提供商：

Install the extractor along with @langchain/core and your chosen LLM provider:

npm install @lightfeed/extractor @langchain/core

然后添加您的 LLM 提供商（我们使用 LangChain 以确保互操作性）：

Then add your LLM provider (we use LangChain for interoperability):

npm install @langchain/openai         # OpenAI
npm install @langchain/google-genai   # Google Gemini
npm install @langchain/anthropic      # Anthropic
npm install @langchain/ollama         # Ollama (local models)

重要 | Important
@langchain/core 是本库与所有 @langchain/* 提供商共享的必需对等依赖项。请务必显式安装它以避免版本冲突。

@langchain/core is a required peer dependency shared by this library and all @langchain/* providers. Always install it explicitly to avoid version conflicts.

使用指南 | Usage

电商产品数据提取 | E-commerce Product Extraction

此示例演示了如何使用 Playwright 加载页面，并使用 extractor 从电商网站提取结构化产品数据。

This example demonstrates extracting structured product data from an e-commerce website using Playwright to load the page and the extractor to pull structured data.

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { chromium } from "playwright";
import { extract, ContentFormat } from "@lightfeed/extractor";
import { z } from "zod";

// Define schema for product catalog extraction
const productCatalogSchema = z.object({
  products: z
    .array(
      z.object({
        name: z.string().describe("Product name or title"),
        brand: z.string().optional().describe("Brand name"),
        price: z.number().describe("Current price"),
        originalPrice: z
          .number()
          .optional()
          .describe("Original price if on sale"),
        rating: z.number().optional().describe("Product rating out of 5"),
        reviewCount: z.number().optional().describe("Number of reviews"),
        productUrl: z.string().url().describe("Link to product detail page"),
        imageUrl: z.string().url().optional().describe("Product image URL"),
      })
    )
    .describe("List of bread and bakery products"),
});

const browser = await chromium.launch();
const page = await browser.newPage();

const pageUrl = "https://www.walmart.ca/en/browse/grocery/bread-bakery/10019_6000194327359";
await page.goto(pageUrl);

try {
  await page.waitForLoadState("networkidle", { timeout: 10000 });
} catch {
  console.log("Network idle timeout, continuing...");
}

const html = await page.content();
await browser.close();

// Extract structured product data
const result = await extract({
  llm: new ChatGoogleGenerativeAI({
    apiKey: process.env.GOOGLE_API_KEY,
    model: "gemini-2.5-flash",
    temperature: 0,
  }),
  content: html,
  format: ContentFormat.HTML,
  sourceUrl: pageUrl,
  schema: productCatalogSchema,
  htmlExtractionOptions: {
    extractMainHtml: true,
    includeImages: true,
    cleanUrls: true
  }
});

console.log("Found products:", result.data.products.length);
console.log(JSON.stringify(result.data, null, 2));

/* Expected output:
{
  "products": [
    {
      "name": "Dempster's® Signature The Classic Burger Buns, Pack of 8; 568 g",
      "brand": "Dempster's",
      "price": 3.98,
      "originalPrice": 4.57,
      "rating": 4.7376,
      "reviewCount": 141,
      "productUrl": "https://www.walmart.ca/en/ip/dempsters-signature-the-classic-burger-buns/6000188080451?classType=REGULAR&athbdg=L1300",
      "imageUrl": "https://i5.walmartimages.ca/images/Enlarge/725/979/6000196725979.jpg?odnHeight=580&odnWidth=580&odnBg=FFFFFF"
    },
    ... (more products)
  ]
}
*/

与浏览器智能体协同使用 | Using with Browser Agent

对于在提取前需要交互的页面——例如搜索、点击分页、关闭弹窗等——您可以将本库与 @lightfeed/browser-agent 配合使用。浏览器智能体利用 AI 通过自然语言命令导航页面，然后本库从结果中提取结构化数据。

For pages that require interaction before extraction — searching, clicking through pagination, dismissing popups, etc. — you can pair this library with @lightfeed/browser-agent. The browser agent uses AI to navigate pages via natural language commands, and this library extracts structured data from the result.

安装两个包：

Install both packages:

npm install @lightfeed/extractor @lightfeed/browser-agent

然后使用浏览器智能体进行导航，并使用提取器提取结构化数据：

Then use the browser agent to navigate and the extractor to pull structured data:

import { BrowserAgent } from "@lightfeed/browser-agent";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat } from "@lightfeed/extractor";
import { z } from "zod";

const schema = z.object({
  products: z.array(
    z.object({
      name: z.string(),
      price: z.number(),
      rating: z.number().optional(),
      productUrl: z.string().url(),
    })
  ),
});

// 1. Use browser agent to navigate with AI
const agent = new BrowserAgent({ browserProvider: "Local" });
const page = await agent.newPage();
await page.goto("https://amazon.com");
await page.ai("Search for 'org

常见问题（FAQ）

Lightfeed Extractor 如何确保从网页提取数据的准确性？

该库利用LLM在JSON模式下根据Zod模式提取结构化数据，并包含JSON恢复功能来清理和修复失败的输出，确保复杂嵌套数据的提取更加鲁棒和准确。

Lightfeed Extractor 能处理电商网站的产品信息抓取吗？

可以。它专门提供了电商产品数据提取的用例，能够与Playwright配合加载页面，然后使用自然语言提示从HTML中提取结构化的产品数据，适用于生产环境。

安装Lightfeed Extractor需要注意什么依赖项？

必须显式安装@langchain/core作为对等依赖项，以避免版本冲突。同时需要安装@lightfeed/extractor以及您选择的LLM提供商包（如@langchain/openai）。

AI Summary (BLUF)