LangExtract是什么？Python库利用大语言模型提取结构化信息

概述

什么是 LangExtract？

LangExtract is a Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured text documents. It can process materials such as clinical notes, literary texts, or reports, identifying and organizing key details while maintaining a precise mapping between the extracted data and its location within the source text.

LangExtract 是一个 Python 库，它利用大型语言模型（LLMs）从非结构化文本文档中提取结构化信息。它可以处理临床笔记、文学文本或报告等材料，识别和组织关键细节，同时保持提取的数据与源文本位置之间的精确映射。

核心能力

LangExtract 提供了一套强大的功能，旨在简化从复杂文本中提取结构化数据的流程。

LangExtract provides a powerful set of features designed to simplify the process of extracting structured data from complex text.

标注来源 (Source Annotation): 将每个提取映射到源文本中的精确字符位置。

Maps each extraction to the precise character position within the source text.
结构化输出 (Structured Output): 根据少量示例规范输出模式。

Defines output schemas based on a few examples.
长文档处理 (Long Document Handling): 通过分块和并行处理来处理大量文本。

Processes large volumes of text through chunking and parallel processing.
交互式可视化 (Interactive Visualization): 生成 HTML 文件，用于在上下文中审查提取内容。

Generates HTML files for reviewing extractions in context.
多供应商支持 (Multi-Vendor Support): 与云 LLM（Gemini、OpenAI）和本地模型（Ollama）配合使用。

Works with cloud LLMs (Gemini, OpenAI) and local models (Ollama).
领域适应性 (Domain Adaptability): 可使用示例配置任何提取任务。

Can be configured for any extraction task using examples.

实践指南

安装

LangExtract can be installed from PyPI or built from source. The library requires Python 3.10 or higher and offers optional dependencies for specific providers.

LangExtract 可以从 PyPI 安装，也可以从源代码构建。该库需要 Python 3.10 及以上版本，并为特定提供程序提供可选的依赖项。

标准安装 (Standard Installation)

pip install langextract

开发安装 (Development Installation)

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"

基本工作流

The core workflow of LangExtract involves three main steps: defining the task, processing the text, and visualizing the results.

LangExtract 的核心工作流程包括三个主要步骤：定义任务、处理文本和可视化结果。

定义输入样本 (Define Input Samples): 创建提示词 (prompt_description) 和示例 (examples) 来指导模型。

Create a prompt (prompt_description) and examples (examples) to guide the model.
调用提取函数 (Call the Extract Function): 使用 lx.extract() 处理输入文本。

Process the input text using lx.extract().
- 内部处理流程 (Internal Processing Flow):
  - 输入处理 (Input Processing): 如果 fetch_urls=True 且输入是 URL，会自动下载文本。
    
    If fetch_urls=True and the input is a URL, text is automatically downloaded.
  - 创建提示模板 (Create Prompt Template): 使用 PromptTemplateStructured 组织提示词和示例。
    
    Uses PromptTemplateStructured to organize the prompt and examples.
  - 模型配置 (Model Configuration): 根据参数优先级创建语言模型（优先级: model > config > model_id）。
    
    Creates a language model based on parameter priority (model > config > model_id).
  - 文本处理 (Text Processing): 通过 Annotator 协调文本分块、并行处理和结果解析。
    
    Coordinates text chunking, parallel processing, and result parsing via the Annotator.
  - 结果对齐 (Result Alignment): 使用 Resolver 将提取结果对齐到源文本位置。
    
    Aligns extraction results to source text positions using the Resolver.
可视化结果 (Visualize Results): 保存结果并生成交互式 HTML 可视化。

Save results and generate an interactive HTML visualization.

The returned AnnotatedDocument contains:

返回的 AnnotatedDocument 包含：

The original text and a document_id.

原始文本和 document_id。
A list of Extraction objects, each containing char_interval positional information.

Extraction 对象列表，每个对象包含 char_interval 位置信息。
An AlignmentStatus for each extraction indicating match quality.

每个提取的 AlignmentStatus，表示匹配质量。

Note: For long documents, using URLs directly with parallel processing and multiple extraction passes can improve performance and accuracy. The system supports multiple model providers (Gemini, OpenAI, Ollama, etc.) and automatically selects the appropriate one via a factory pattern.

注：对于长文档，直接使用 URL 并启用并行处理和多次提取可以提高性能和准确性。系统支持多种模型提供商（Gemini、OpenAI、Ollama 等），并通过工厂模式自动选择合适的提供商。

调用 Qwen 模型示例

The following code demonstrates how to use LangExtract with the Qwen model to extract structured medication information from medical text.

以下代码演示了如何使用 LangExtract 与 Qwen 模型从医疗文本中提取结构化的药物信息。

import langextract as lx
from langextract.providers.openai import OpenAILanguageModel

# Text with a medication mention
input_text = "Patient took 400 mg PO Ibuprofen q4h for two days."

# Define extraction prompt
prompt_description = "Extract medication information including medication name, dosage, route, frequency, and duration in the order they appear in the text."

# Define example data with entities in order of appearance
examples = [
    lx.data.ExampleData(
        text="Patient was given 250 mg IV Cefazolin TID for one week.",
        extractions=[
            lx.data.Extraction(extraction_class="dosage", extraction_text="250 mg"),
            lx.data.Extraction(extraction_class="route", extraction_text="IV"),
            lx.data.Extraction(extraction_class="medication", extraction_text="Cefazolin"),
            lx.data.Extraction(extraction_class="frequency", extraction_text="TID"),  # TID = three times a day
            lx.data.Extraction(extraction_class="duration", extraction_text="for one week")
        ]
    )
]

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt_description,
    examples=examples,
    fence_output=True,
    use_schema_constraints=False,
    model = OpenAILanguageModel(
        model_id='qwen-plus',
        base_url='', # Your API base URL
        api_key='', # Your API key
        provider_kwargs={
            'connect_timeout': 60,  # Allow 60 seconds for SSL handshake
            'timeout': 120          # Keep 120 seconds overall request timeout
        }
    )
)

# Display entities with positions
print(f"Input: {input_text}\n")
print("Extracted entities:")
for entity in result.extractions:
    position_info = ""
    if entity.char_interval:
        start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
        position_info = f" (pos: {start}-{end})"
    print(f"• {entity.extraction_class.capitalize()}: {entity.extraction_text}{position_info}")

# Save and visualize the results
lx.io.save_annotated_documents([result], output_name="medical_ner_extraction.jsonl", output_dir=".")

# Generate the interactive visualization
html_content = lx.visualize("medical_ner_extraction.jsonl")
with open("medical_ner_visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)

print("Interactive visualization saved to medical_ner_visualization.html")

The core objective of this code is to use the langextract library with a large language model (Qwen) to automatically extract structured medication information (dosage, route, name, etc.) from medical text, and present the results through printing, file saving, and HTML visualization. It is suitable for scenarios such as medical text analysis and medication information extraction.

这段代码的核心目标是：使用 langextract 库对接大语言模型（Qwen），从医疗文本中自动提取结构化的药物信息（剂量、途径、名称等），并通过打印、文件保存、HTML 可视化等方式展示结果。适用于医疗文本分析、药物信息抽取等场景。

生成的 HTML 可视化界面示例如下：

Example of the generated HTML visualization interface:

解决中文乱码问题 (Fixing Chinese Character Encoding Issues): Initially, the generated HTML might display garbled characters for Chinese text. To fix this, ensure the HTML file uses UTF-8 encoding.

解决中文乱码问题：最初生成的 HTML 对于中文文本可能会出现乱码。要解决此问题，请确保 HTML 文件使用 UTF-8 编码。

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>医疗实体提取可视化</title>
    <style>
        <!-- Keep original CSS styles unchanged -->
    </style>
</head>
<body>
    <!-- Keep original HTML content unchanged -->
    <div class="lx-animated-wrapper lx-gif-optimized">
        <!-- ... original content ... -->
    </div>

    <script>
        <!-- Keep original JavaScript code unchanged -->
    </script>
</body>
</html>

Furthermore, I discovered that the tokenizer did not initially support Chinese characters!

此外，我还发现其分词器最初不支持中文！

增加中文分词支持 (Adding Chinese Tokenizer Support): Modify the tokenizer patterns to include Chinese character ranges.

增加中文分词支持：修改分词器模式以包含中文字符范围。

# ✅ Updated to support Chinese characters (CJK Unified Ideographs, Extension A, Compatibility Ideographs)
#    and other Unicode languages

_LETTERS_PATTERN = (
    r"[A-Za-z\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+"
)
"""Matches consecutive letters in Chinese and English (includes CJK Basic, Extension A, and Compatibility blocks)"""

_DIGITS_PATTERN = (
    r"[0-9\uff10-\uff19]+"
)
"""Matches Arabic numerals and full-width numerals"""

_SYMBOLS_PATTERN = (
    r"[^A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff\s]+"
)
"""Matches symbols other than Chinese, English, numerals, and whitespace (includes full-width symbols)"""

_END_OF_SENTENCE_PATTERN = re.compile(r"[.?!。？！]$")
"""Matches end-of-sentence punctuation (includes Chinese and English punctuation)"""

_SLASH_ABBREV_PATTERN = (
    r"[A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+"
    r"(?:/[A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)+"
)
"""Matches slash abbreviations or compound words like '中/英/混合'"""

_TOKEN_PATTERN = re.compile(
    rf"{_SLASH_ABBREV_PATTERN}|{_LETTERS_PATTERN}|{_DIGITS_PATTERN}|{_SYMBOLS_PATTERN}"
)
"""Universal token matching pattern: supports Chinese, English, numerals, symbols"""

_WORD_PATTERN = re.compile(
    rf"(?:{_LETTERS_PATTERN}|{_DIGITS_PATTERN})\Z"
)
"""Matches complete words (ending with letters or numerals)"""

With the modifications above, Chinese language support is enabled.

修改如上内容就可以支持中文啦。

PDF 文档支持

Currently, LangExtract only supports processing raw text strings. In real-world workflows, source files are often in PDF, DOCX, or PPTX formats. Users currently must:

LangExtract 目前仅支持处理原始文本字符串。在实际工作流程中，源文件通常是 PDF、DOCX 或 PPTX 格式。用户目前必须：

Manually convert files to text (losing layout and provenance).

手动将文件转换为文本（丢失布局和出处）。
Input the plain text into LangExtract.

将纯文本输入到 LangExtract 中。
Manually map extractions back to the original document for verification.

手动将提取内容映射回原始文档以进行验证。

A single-step workflow would make adopting LangExtract much easier.

单步流程将使 LangExtract 的采用变得更为简便。

建议的解决方案 (Proposed Solution): Integrate the Docling library as an optional front-end.

建议的解决方案：将 Docling 库作为可选前端集成。

Docling can convert various document formats into a unified DoclingDocument.

Docling 可以将多种文档格式转换为统一的 DoclingDocument。
It preserves provenance (page, bounding box, reading order).

它保留了来源（页面、边界框、阅读顺序）。
Feed the extracted text chunks into LangExtract as is done today.

将提取的文本块按照今天的方式输入到 LangExtract 中。
Map extraction results back to the original document via provenance metadata.

通过起源元数据将提取的结果映射回原始文档。
The integration would be optional (pip install langextract[docling]), keeping the core package dependency-free.

集成将是可选的（pip install langextract[docling]），因此核心包保持无依赖性。

概念验证 (Proof of Concept): The following is a conceptual example and is not yet integrated into the main codebase.

概念验证：以下是一个概念性示例，尚未集成到主代码库中。

import langextract as lx
import textwrap
from pdf_extract import extract_with_file_support # Hypothetical function

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

source = "<sample pdf file>.pdf"
result = extract_with_file_support(
    source=source,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

# result.extractions[0].extraction_text
# result.extractions[0].provenance # Would contain page, bounding box info

Summary: Currently, LangExtract feels like an integration of various large models with well-encapsulated input and output. It does not yet natively support PDF processing.

总结：现在的 LangExtract 感觉像是集成了各种大模型，封装好了输入输出。目前其实也不原生支持 PDF。

分析与对比

The following table compares LangExtract with a hypothetical Qwen-driven extraction system across several key dimensions.

下表从几个关键维度对比了 LangExtract 与一个假设的 Qwen 驱动抽取系统。

维度 (Dimension)	LangExtract 的强/劣势 (LangExtract Strengths/Weaknesses)	Qwen 驱动抽取系统可能的强/劣势 (Potential Qwen-driven System Strengths/Weaknesses)	总体判断/建议 (Overall Assessment / Recommendation)
PDF / 复杂文档支持 (PDF / Complex Doc Support)	劣势 — 不能直接支持 PDF / DOC 等格式，需要用户先转文本。 (Weakness — Cannot directly support PDF/DOC; requires user pre-conversion.)	如果工程设计完整，有机会支持直接从 PDF / DOC /OCR 输入。 (If well-engineered, could potentially support direct input from PDF/DOC/OCR.)	如果你的工作流程中有大量 PDF /Word 文档，Qwen 工程若支持这一环节会是一个显著优势。 (If your workflow involves many PDF/Word docs, a Qwen system supporting this would be a significant advantage.)
中文 / 多语言支持 (Chinese / Multilingual Support)	劣势 — 对非拉丁字符处理有已知问题 /对齐、字符位置可能出错。 (Weakness — Known issues with non-Latin characters; alignment/position may be inaccurate.)	优势可能更明显（尤其是在中文为主的语境下）。 (Potential advantage more pronounced, especially in Chinese-dominated contexts.)	在中文主导场景下，Qwen 工程可能更好；但还是要做适配和测试。 (For Chinese-dominated scenarios, a Qwen system might be better, but adaptation and testing are still required.)
应用范围 / 任务领域适应性 (Scope / Task Domain Adaptability)	优势 — 通过少量示例即可适配领域抽取任务