如何从非结构化文本提取结构化信息？LangExtract库2024指南

引言

LangExtract 是一个 Python 库，它利用大语言模型（LLM）根据用户定义的指令，从非结构化文本文档中提取结构化信息。它能处理诸如临床记录或报告等材料，识别并组织关键细节，同时确保提取的数据与源文本相对应。

LangExtract is a Python library that uses Large Language Models (LLMs) to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.

为何选择 LangExtract？

LangExtract 旨在解决从复杂文档中提取信息的核心挑战，提供一套独特的功能：

精确的源文本定位：将每次提取都映射到源文本中的确切位置，支持可视化高亮，便于追溯和验证。

Precise Source Grounding: Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.
可靠的结构化输出：基于您提供的少量示例强制执行一致的输出模式，并利用 Gemini 等支持受控生成的模型，保证稳健的结构化结果。

Reliable Structured Outputs: Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini to guarantee robust, structured results.
针对长文档优化：通过优化的文本分块、并行处理同时处理多个文本块或文档，以提高处理速度和效率。和多轮提取通过extraction_passes参数设置多次处理文档的轮数，每轮可能发现新的实体，从而提高整体召回率，特别适用于复杂的长文档。策略，克服从大型文档中寻找“大海捞针”的挑战，实现更高的召回率。

Optimized for Long Documents: Overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.
交互式可视化生成交互式HTML可视化，通过直观的高亮显示在上下文中查看提取的实体。：即时生成一个独立的交互式 HTML 文件，用于在原始上下文中可视化和审查数千个提取的实体。

Interactive Visualization: Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in their original context.
灵活的 LLM 支持：支持您偏好的模型，从基于云的 LLM（如 Google Gemini 系列）到通过内置 Ollama 接口访问的本地开源模型。

Flexible LLM Support: Supports your preferred models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.
适应任何领域：仅需几个示例即可为任何领域定义提取任务。LangExtract 无需模型微调即可适应您的需求。

Adaptable to Any Domain: Define extraction tasks for any domain using just a few examples. LangExtract adapts to your needs without requiring any model fine-tuning.
利用 LLM 世界知识：利用精确的提示词措辞和少量示例来影响提取任务如何利用 LLM 的知识。任何推断信息的准确性及其对任务规范的遵守程度，取决于所选 LLM、任务的复杂性、提示指令的清晰度以及提示示例的性质。

Leverages LLM World Knowledge: Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. The accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.

快速入门

注意：使用 Gemini 等云端托管模型需要 API 密钥。请参阅 API 密钥设置 部分了解如何获取和配置您的密钥。

Note: Using cloud-hosted models like Gemini requires an API key. See the API Key Setup section for instructions on how to get and configure your key.

只需几行代码即可提取结构化信息。

1. 定义您的提取任务

首先，创建一个清晰描述您要提取内容的提示。然后，提供一个高质量的示例来引导模型。

import langextract as lx
import textwrap

# 1. 定义提示和提取规则
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. 提供一个高质量的示例来引导模型
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

注意：示例驱动模型行为。每个 extraction_text 理想情况下应逐字来自示例文本（不进行释义），并按出现顺序列出。默认情况下，如果示例不遵循此模式，LangExtract 会引发 提示对齐警告——解决这些问题以获得最佳结果。

Note: Examples drive model behavior. Each extraction_text should ideally be verbatim from the example's text (no paraphrasing), listed in order of appearance. LangExtract raises Prompt alignment warnings by default if examples don't follow this pattern—resolve these for best results.

2. 运行提取

将您的输入文本和提示材料提供给 lx.extract 函数。

# 要处理的输入文本
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# 运行提取
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

模型选择：gemini-2.5-flash 是推荐的默认模型，在速度、成本和质量之间提供了极佳的平衡。对于需要更深层次推理的高度复杂任务，gemini-2.5-pro 可能提供更优的结果。对于大规模或生产环境使用，建议使用 Tier 2 Gemini 配额以提高吞吐量并避免速率限制。详情请参阅速率限制文档。

Model Selection: gemini-2.5-flash is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning, gemini-2.5-pro may provide superior results. For large-scale or production use, a Tier 2 Gemini quota is suggested to increase throughput and avoid rate limits. See the rate-limit documentation for details.
模型生命周期：请注意，Gemini 模型具有定义的生命周期和退役日期。用户应查阅官方模型版本文档以了解最新的稳定版和旧版信息。
Model Lifecycle: Note that Gemini models have a lifecycle with defined retirement dates. Users should consult the official model version documentation to stay informed about the latest stable and legacy versions.

3. 可视化结果

提取结果可以保存到 .jsonl 文件（一种用于处理语言模型数据的流行格式）。然后，LangExtract 可以从该文件生成交互式 HTML 可视化，以便在上下文中审查实体。

# 将结果保存到 JSONL 文件
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# 从文件生成可视化
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # 适用于 Jupyter/Colab
    else:
        f.write(html_content)

这将创建一个动画交互式 HTML 文件。

关于 LLM 知识利用的说明：此示例演示了贴近文本证据的提取——提取 Lady Juliet 的“longing”情绪状态，并从“gazed longingly at the stars”中识别出“yearning”。可以修改任务以生成更多依赖 LLM 世界知识的属性（例如，添加 "identity": "Capulet family daughter" 或 "literary_context": "tragic heroine"）。文本证据与知识推断之间的平衡由您的提示指令和示例属性控制。

Note on LLM Knowledge Utilization: This example demonstrates extractions that stay close to the text evidence - extracting "longing" for Lady Juliet's emotional state and identifying "yearning" from "gazed longingly at the stars." The task could be modified to generate attributes that draw more heavily from the LLM's world knowledge (e.g., adding "identity": "Capulet family daughter" or "literary_context": "tragic heroine"). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.

扩展到更长文档

对于更大的文本，您可以直接从 URL 处理整个文档，并利用并行处理同时处理多个文本块或文档，以提高处理速度和效率。和增强的敏感性：

# 直接从古登堡计划处理《罗密欧与朱丽叶》
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,    # 通过多轮提取提高召回率
    max_workers=20,         # 并行处理以提高速度
    max_char_buffer=1000    # 更小的上下文以获得更好的准确性
)

这种方法可以从完整的小说中提取数百个实体，同时保持高准确性。交互式可视化生成交互式HTML可视化，通过直观的高亮显示在上下文中查看提取的实体。可以无缝处理大型结果集，便于从输出 JSONL 文件中探索数百个实体。请参阅完整的《罗密欧与朱丽叶》提取示例 → 以获取详细结果和性能分析。

Vertex AI 批处理：通过启用 Vertex AI Batch API 来节省大规模任务的成本：language_model_params={"vertexai": True, "batch": {"enabled": True}}。在此示例中查看 Vertex AI Batch API 的使用示例。

Vertex AI Batch Processing: Save costs on large-scale tasks by enabling Vertex AI Batch API: language_model_params={"vertexai": True, "batch": {"enabled": True}}. See an example of the Vertex AI Batch API usage in this example.

安装

从 PyPI 安装

pip install langextract

推荐大多数用户使用。对于隔离环境，请考虑使用虚拟环境：

python -m venv langextract_env
source langextract_env/bin/activate  # Windows 系统: langextract_env\Scripts\activate
pip install langextract

从源码安装

LangExtract 使用现代的 Python 打包方式，通过 pyproject.toml 进行依赖管理：

使用 -e 安装会将包置于开发模式，允许您修改代码而无需重新安装。

Installing with -e puts the package in development mode, allowing you to modify the code without reinstalling.

git clone https://github.com/google/langextract.git
cd langextract

# 基本安装：
pip install -e .

# 开发安装（包含代码检查工具）：
pip install -e ".[dev]"

# 测试安装（包含 pytest）：
pip install -e ".[test]"

Docker

docker build -t langextract .
docker run --rm -e LANGEXTRACT_API_KEY="your-api-key" langextract python your_script.py

API 密钥设置（用于云端模型）

当使用 LangExtract 与云端托管模型（如 Gemini 或 OpenAI）时，您需要设置 API 密钥。设备端模型不需要 API 密钥。对于使用本地 LLM 的开发者，LangExtract 提供了对 Ollama 的内置支持，并且可以通过更新推理端点扩展到其他第三方 API。

API 密钥来源

从以下位置获取 API 密钥：

AI Studio：用于 Gemini 模型

AI Studio: for Gemini models
Vertex AI：用于企业用途

Vertex AI: for enterprise use
OpenAI Platform：用于 OpenAI 模型

OpenAI Platform: for OpenAI models

在环境中设置 API 密钥

选项 1：环境变量

export LANGEXTRACT_API_KEY="your-api-key-here"

选项 2：`.env` 文件（推荐）

将您的 API 密钥添加到 .env 文件：

# 将 API 密钥添加到 .env 文件
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF

# 确保 API 密钥安全
echo '.env' >> .gitignore

在您的 Python 代码中：

import langextract as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash"
)

选项 3：直接提供 API 密钥（不推荐用于生产环境）

您也可以在代码中直接提供 API 密钥，但这不推荐用于生产环境：

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash",
    api_key="your-api-key-here"  # 仅用于测试/开发
)

选项 4：Vertex AI（服务账号）

使用 Vertex AI 通过服务账号进行身份验证：

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash",
    language_model_params={
        "vertexai": True,
        "project": "your-project-id",
        "location": "global"  # 或区域端点
    }
)

添加自定义模型提供商

LangExtract 通过轻量级插件系统支持自定义 LLM 提供商。您可以在不更改核心代码的情况下添加对新模型的支持。

独立于核心库添加新模型支持

Add new model support independently of the core library
将您的提供商作为独立的 Python 包分发

Distribute your provider as a separate Python package
保持自定义依赖隔离

Keep custom dependencies isolated
通过基于优先级的解析覆盖或扩展内置提供商

Override or extend built-in providers via priority-based resolution

请参阅 提供商系统文档 中的详细指南，了解如何：

使用 @registry.register(...) 注册提供商

Register a provider with @registry.register(...)
发布一个用于发现的入口点

Publish an entry point for discovery
可选地通过 get_schema_class() 提供模式以支持结构化输出

Optionally provide a schema with get_schema_class() for structured output
通过 create_model(...) 与工厂集成

Integrate with the factory via create_model(...)

使用 OpenAI 模型

LangExtract 支持 OpenAI 模型（需要可选依赖：pip install langextract[openai]）：

import langextract as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",  # 自动选择 OpenAI 提供商
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)

注意：OpenAI 模型需要 fence_output=True 和 use_schema_constraints=False，因为 LangExtract 尚未为 OpenAI 实现模式约束。

Note: OpenAI models require fence_output=True and use_schema_constraints=False because LangExtract doesn't implement schema constraints for OpenAI yet.

使用本地 LLM（通过 Ollama）

LangExtract 支持使用 Ollama 进行本地推理，允许您在没有 API 密钥的情况下运行模型：

import langextract as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",  # 自动选择 Ollama 提供商
    model_url="http://localhost:11434",
    fence_output=False,
    use_schema_constraints=False
)

快速设置：从 ollama.com 安装 Ollama，运行 ollama pull gemma2:2b，然后运行 ollama serve。有关详细安装、Docker 设置和示例，请参阅 examples/ollama/。

Quick setup: Install Ollama from ollama.com, run ollama pull gemma2:2b, then ollama serve. For detailed installation, Docker setup, and examples, see examples/ollama/.

如何从非结构化文本提取结构化信息？LangExtract库2026指南 | Geoz.com.cn

引言

为何选择 LangExtract？

快速入门

1. 定义您的提取任务

2. 运行提取

3. 可视化结果

扩展到更长文档

安装

从 PyPI 安装

从源码安装

Docker

API 密钥设置（用于云端模型）

API 密钥来源

在环境中设置 API 密钥

选项 1：环境变量

选项 2：`.env` 文件（推荐）

选项 3：直接提供 API 密钥（不推荐用于生产环境）

选项 4：Vertex AI（服务账号）

添加自定义模型提供商

使用 OpenAI 模型

使用本地 LLM（通过 Ollama）

更多示例

《罗密欧与朱丽叶》全文提取

药物信息提取

放射学报告结构化：RadExtract

引言

为何选择 LangExtract？

快速入门

1. 定义您的提取任务

2. 运行提取

3. 可视化结果

扩展到更长文档

安装

从 PyPI 安装

从源码安装

Docker

API 密钥设置（用于云端模型）

API 密钥来源

在环境中设置 API 密钥

选项 1：环境变量

选项 2：.env 文件（推荐）

选项 3：直接提供 API 密钥（不推荐用于生产环境）

选项 4：Vertex AI（服务账号）

添加自定义模型提供商

使用 OpenAI 模型

使用本地 LLM（通过 Ollama）

更多示例

《罗密欧与朱丽叶》全文提取

药物信息提取

放射学报告结构化：RadExtract

选项 2：`.env` 文件（推荐）