LangExtract是什么？Python库利用大语言模型提取结构化信息

概述

什么是 LangExtract？

LangExtract 是一个 Python 库，它利用大型语言模型（LLMs）从非结构化文本文档中提取结构化信息。它可以处理临床笔记、文学文本或报告等材料，识别和组织关键细节，同时保持提取的数据与源文本位置之间的精确映射。

核心能力

LangExtract 提供了一套强大的功能，旨在简化从复杂文本中提取结构化数据的流程。

LangExtract provides a powerful set of features designed to simplify the process of extracting structured data from complex text.

标注来源 (Source Annotation): 将每个提取映射到源文本中的精确字符位置。
Maps each extraction to the precise character position within the source text.
结构化输出 (Structured Output): 根据少量示例规范输出模式。
Defines output schemas based on a few examples.
长文档处理 (Long Document Handling): 通过分块和并行处理来处理大量文本。
Processes large volumes of text through chunking and parallel processing.
交互式可视化 (Interactive Visualization): 生成 HTML 文件，用于在上下文中审查提取内容。
Generates HTML files for reviewing extractions in context.
多供应商支持 (Multi-Vendor Support): 与云 LLM（Gemini、OpenAI）和本地模型（Ollama）配合使用。
Works with cloud LLMs (Gemini, OpenAI) and local models (Ollama).
领域适应性 (Domain Adaptability): 可使用示例配置任何提取任务。
Can be configured for any extraction task using examples.

实践指南

安装

LangExtract 可以从 PyPI 安装，也可以从源代码构建。该库需要 Python 3.10 及以上版本，并为特定提供程序提供可选的依赖项。

pip install langextract

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"

基本工作流

LangExtract 的核心工作流程包括三个主要步骤：定义任务、处理文本和可视化结果。

定义输入样本 (Define Input Samples): 创建提示词 (prompt_description) 和示例 (examples) 来指导模型。
Create a prompt (prompt_description) and examples (examples) to guide the model.
调用提取函数 (Call the Extract Function): 使用 lx.extract() 处理输入文本。
Process the input text using lx.extract().
* 输入处理 (Input Processing): 如果 fetch_urls=True 且输入是 URL，会自动下载文本。

If fetch_urls=True and the input is a URL, text is automatically downloaded.
Uses PromptTemplateStructured to organize the prompt and examples.
* 模型配置 (Model Configuration): 根据参数优先级创建语言模型（优先级: model > config > model_id）。
Creates a language model based on parameter priority (model > config > model_id).
* 文本处理 (Text Processing): 通过 Annotator 协调文本分块、并行处理和结果解析。
Coordinates text chunking, parallel processing, and result parsing via the Annotator.
* 结果对齐 (Result Alignment): 使用 Resolver 将提取结果对齐到源文本位置。
Aligns extraction results to source text positions using the Resolver.
3. 可视化结果 (Visualize Results): 保存结果并生成交互式 HTML 可视化。
Save results and generate an interactive HTML visualization.

返回的 AnnotatedDocument 包含：

原始文本和 document_id。
Extraction 对象列表，每个对象包含 char_interval 位置信息。
每个提取的 AlignmentStatus，表示匹配质量。

注：对于长文档，直接使用 URL 并启用并行处理和多次提取可以提高性能和准确性。系统支持多种模型提供商（Gemini、OpenAI、Ollama 等），并通过工厂模式自动选择合适的提供商。

调用 Qwen 模型示例

以下代码演示了如何使用 LangExtract 与 Qwen 模型从医疗文本中提取结构化的药物信息。

import langextract as lx
from langextract.providers.openai import OpenAILanguageModel

# Text with a medication mention
input_text = "Patient took 400 mg PO Ibuprofen q4h for two days."

# Define extraction prompt
prompt_description = "Extract medication information including medication name, dosage, route, frequency, and duration in the order they appear in the text."

# Define example data with entities in order of appearance
examples = [
    lx.data.ExampleData(
        text="Patient was given 250 mg IV Cefazolin TID for one week.",
        extractions=[
            lx.data.Extraction(extraction_class="dosage", extraction_text="250 mg"),
            lx.data.Extraction(extraction_class="route", extraction_text="IV"),
            lx.data.Extraction(extraction_class="medication", extraction_text="Cefazolin"),
            lx.data.Extraction(extraction_class="frequency", extraction_text="TID"),  # TID = three times a day
            lx.data.Extraction(extraction_class="duration", extraction_text="for one week")
        ]
    )
]

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt_description,
    examples=examples,
    fence_output=True,
    use_schema_constraints=False,
    model = OpenAILanguageModel(
        model_id='qwen-plus',
        base_url='', # Your API base URL
        api_key='', # Your API key
        provider_kwargs={
            'connect_timeout': 60,  # Allow 60 seconds for SSL handshake
            'timeout': 120          # Keep 120 seconds overall request timeout
        }
    )
)

# Display entities with positions
print(f"Input: {input_text}\n")
print("Extracted entities:")
for entity in result.extractions:
    position_info = ""
    if entity.char_interval:
        start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
        position_info = f" (pos: {start}-{end})"
    print(f"• {entity.extraction_class.capitalize()}: {entity.extraction_text}{position_info}")

# Save and visualize the results
lx.io.save_annotated_documents([result], output_name="medical_ner_extraction.jsonl", output_dir=".")

# Generate the interactive visualization
html_content = lx.visualize("medical_ner_extraction.jsonl")
with open("medical_ner_visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)

print("Interactive visualization saved to medical_ner_visualization.html")

这段代码的核心目标是：使用 langextract 库对接大语言模型（Qwen），从医疗文本中自动提取结构化的药物信息（剂量、途径、名称等），并通过打印、文件保存、HTML 可视化等方式展示结果。适用于医疗文本分析、药物信息抽取等场景。

生成的 HTML 可视化界面示例如下：
Example of the generated HTML visualization interface:

解决中文乱码问题：最初生成的 HTML 对于中文文本可能会出现乱码。要解决此问题，请确保 HTML 文件使用 UTF-8 编码。

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>医疗实体提取可视化</title>
    <style>
        <!-- Keep original CSS styles unchanged -->
    </style>
</head>
<body>
    <!-- Keep original HTML content unchanged -->
    <div class="lx-animated-wrapper lx-gif-optimized">
        <!-- ... original content ... -->
    </div>

    <script>
        <!-- Keep original JavaScript code unchanged -->
    </script>
</body>
</html>

此外，我还发现其分词器最初不支持中文！

增加中文分词支持：修改分词器模式以包含中文字符范围。

# ✅ Updated to support Chinese characters (CJK Unified Ideographs, Extension A, Compatibility Ideographs)
#    and other Unicode languages

_LETTERS_PATTERN = (
    r"[A-Za-z\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+"
)
"""Matches consecutive letters in Chinese and English (includes CJK Basic, Extension A, and Compatibility blocks)"""

_DIGITS_PATTERN = (
    r"[0-9\uff10-\uff19]+"
)
"""Matches Arabic numerals and full-width numerals"""

_SYMBOLS_PATTERN = (
    r"[^A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff\s]+"
)
"""Matches symbols other than Chinese, English, numerals, and whitespace (includes full-width symbols)"""

_END_OF_SENTENCE_PATTERN = re.compile(r"[.?!。？！]$")
"""Matches end-of-sentence punctuation (includes Chinese and English punctuation)"""

_SLASH_ABBREV_PATTERN = (
    r"[A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+"
    r"(?:/[A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)+"
)
"""Matches slash abbreviations or compound words like '中/英/混合'"""

_TOKEN_PATTERN = re.compile(
    rf"{_SLASH_ABBREV_PATTERN}|{_LETTERS_PATTERN}|{_DIGITS_PATTERN}|{_SYMBOLS_PATTERN}"
)
"""Universal token matching pattern: supports Chinese, English, numerals, symbols"""

_WORD_PATTERN = re.compile(
    rf"(?:{_LETTERS_PATTERN}|{_DIGITS_PATTERN})\Z"
)
"""Matches complete words (ending with letters or numerals)"""

修改如上内容就可以支持中文啦。

PDF 文档支持

LangExtract 目前仅支持处理原始文本字符串。在实际工作流程中，源文件通常是 PDF、DOCX 或 PPTX 格式。用户目前必须：

手动将文件转换为文本（丢失布局和出处）。
将纯文本输入到 LangExtract 中。
手动将提取内容映射回原始文档以进行验证。

单步流程将使 LangExtract 的采用变得更为简便。

建议的解决方案：将 Docling 库作为可选前端集成。

Docling 可以将多种文档格式转换为统一的 DoclingDocument。
它保留了来源（页面、边界框、阅读顺序）。
将提取的文本块按照今天的方式输入到 LangExtract 中。
通过起源元数据将提取的结果映射回原始文档。
集成将是可选的（pip install langextract[docling]），因此核心包保持无依赖性。

概念验证：以下是一个概念性示例，尚未集成到主代码库中。

import langextract as lx
import textwrap
from pdf_extract import extract_with_file_support # Hypothetical function

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

source = "<sample pdf file>.pdf"
result = extract_with_file_support(
    source=source,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

# result.extractions[0].extraction_text
# result.extractions[0].provenance # Would contain page, bounding box info

总结：现在的 LangExtract 感觉像是集成了各种大模型，封装好了输入输出。目前其实也不原生支持 PDF。

分析与对比

下表从几个关键维度对比了 LangExtract 与一个假设的 Qwen 驱动抽取系统。


维度 (Dimension)	LangExtract 的强/劣势 (LangExtract Strengths/Weaknesses)	Qwen 驱动抽取系统可能的强/劣势 (Potential Qwen-driven System Strengths/Weaknesses)	总体判断/建议 (Overall Assessment / Recommendation)
PDF / 复杂文档支持 (PDF / Complex Doc Support)	劣势 — 不能直接支持 PDF / DOC 等格式，需要用户先转文本。 (Weakness — Cannot directly support PDF/DOC; requires user pre-conversion.)	如果工程设计完整，有机会支持直接从 PDF / DOC /OCR 输入。 (If well-engineered, could potentially support direct input from PDF/DOC/OCR.)	如果你的工作流程中有大量 PDF /Word 文档，Qwen 工程若支持这一环节会是一个显著优势。 (If your workflow involves many PDF/Word docs, a Qwen system supporting this would be a significant advantage.)
中文 / 多语言支持 (Chinese / Multilingual Support)	劣势 — 对非拉丁字符处理有已知问题 /对齐、字符位置可能出错。 (Weakness — Known issues with non-Latin characters; alignment/position may be inaccurate.)	优势可能更明显（尤其是在中文为主的语境下）。 (Potential advantage more pronounced, especially in Chinese-dominated contexts.)	在中文主导场景下，Qwen 工程可能更好；但还是要做适配和测试。 (For Chinese-dominated scenarios, a Qwen system might be better, but adaptation and testing are still required.)
应用范围 / 任务领域适应性 (Scope / Task Domain Adaptability)	优势 — 通过少量示例即可适配领域抽取任务

LangExtract库：利用大语言模型精准提取结构化信息2026指南

BLUF 摘要

概述

什么是 LangExtract？

核心能力

实践指南

安装

基本工作流

调用 Qwen 模型示例

PDF 文档支持

分析与对比

FlashLearn实测：3行代码把LLM集成到ETL管道

如何检测GEO效果推荐使用本站GEO Tool

AI聊天记录里的知识怎么存？我试了LoreSpec结构化标准的实际效果

大型语言模型（LLM）是什么？2026年核心概念与应用全解析

BLUF 摘要

概述

什么是 LangExtract？

核心能力

实践指南

安装

基本工作流

调用 Qwen 模型示例

PDF 文档支持

分析与对比

相关文章

FlashLearn实测：3行代码把LLM集成到ETL管道

如何检测GEO效果 推荐使用本站GEO Tool

AI聊天记录里的知识怎么存？我试了LoreSpec结构化标准的实际效果

大型语言模型（LLM）是什么？2026年核心概念与应用全解析

如何检测GEO效果推荐使用本站GEO Tool