如何使用LangExtract构建知识图谱？2025年Google开源工具实战指南

摘要

LangExtract 是 Google 于 2025 年 7 月 30 日开源的一款“程序化抽取LangExtract的核心特性，不仅精准识别所需信息，还能把每条抽取结果链接到原文的确切字符位置（offset），实现可追溯、可验证的结构化输出。”工具。它专门面向邮件、报告、病历等非结构化文本，能够精确抽取所需信息，并将每个抽取结果与原文字符偏移量（offset）绑定，从而实现可追溯、可高亮验证的结构化输出。其核心能力包括：对长文档进行分块与并行处理、通过多轮抽取确保召回率、直接生成结构化结果以减少传统 RAG 工作流中的切分与向量嵌入开销。该工具可同时兼容云端大模型（如 Gemini）与本地开源模型，并支持自定义提示模板以适配不同领域。本文提供了一个以 Streamlit 为界面、Agraph 为可视化组件、LangExtract 为抽取核心的“知识图谱 + 聊天机器人”最小可行示例。该示例展示了如何根据关键词动态选择 few-shot 模板，并行抽取实体与关系，构建知识图谱节点与边，并在未检测到显式关系时，以“related_to”边作为回退策略保持图的连通性。最后，系统支持查询过滤，并在多标签页中展示图谱、实体、关系与查询结果。文章最后提醒：单一工具无法解决所有问题，需要组合使用多种工具，并通过持续的反馈迭代来提升系统质量。

LangExtract is an open-source “programmatic extraction” tool released by Google on July 30, 2025. It is designed for unstructured texts such as emails, reports, and medical records, enabling precise extraction of required information and binding each extraction result to its original character offset. This facilitates traceable, highlight-verifiable structured output. Its core capabilities include: chunking and parallel processing of long documents, multi-round extraction to ensure recall, and direct generation of structured results to reduce the segmentation and vector embedding overhead in traditional RAG workflows. The tool is compatible with both cloud-based large models (e.g., Gemini) and local open-source models, and supports custom prompt templates for different domains. This article provides a minimal viable demonstration of a “Knowledge Graph + Chatbot” system using Streamlit for the interface, Agraph for visualization, and LangExtract as the extraction core. The demo illustrates dynamic few-shot template selection based on keywords, parallel extraction of entities and relationships, construction of graph nodes and edges, and a fallback strategy using “related_to” edges to maintain graph connectivity when no explicit relationships are detected. Finally, the system supports query filtering and displays the graph, entities, relationships, and query results across multiple tabs. The article concludes with a reminder that no single tool can solve all problems; a combination of tools and continuous feedback iteration is necessary to improve quality.

1 主题导入

在这个快速演示中，我将展示如何用 LangExtract 构建一个知识图谱，并将其与聊天机器人结合，为企业或个人打造实用的问答系统。在数据驱动的今天，许多有价值的信息潜藏在非结构化文本中（如临床记录、冗长的法律合同、用户反馈话题）。从这些文档中抽取“有意义且可追溯”的信息一直是技术与实践上的双重挑战。

In this quick demonstration, I will show how to use LangExtract to build a knowledge graph and integrate it with a chatbot to create a practical Q&A system for enterprises or individuals. In today's data-driven world, valuable information often lies hidden within unstructured texts (e.g., clinical notes, lengthy legal contracts, user feedback topics). Extracting “meaningful and traceable” information from these documents has always been a dual challenge in both technology and practice.

2025 年 7 月 30 日，Google 发布开源 AI 项目 LangExtract。该工具能从我们每天会读到的文本（邮件、报告、病历等）中“只抽取必要信息”，并组织成计算机易处理的格式。尽管 AI 很有用，但也有短板：可能产生幻觉、提供错误信息、一次可保留的上下文有限、同样问题可能每次回答不同。LangExtract 充当一座“智能桥梁”，弥补上述弱点，把“理解文本”的能力转化为“抽取可靠信息”的能力。接下来给出一个在线聊天机器人的快速演示来说明这一点。

On July 30, 2025, Google released the open-source AI project LangExtract. This tool can “extract only the necessary information” from the texts we read daily (emails, reports, medical records, etc.) and organize it into a computer-friendly format. While AI is useful, it also has shortcomings: it may hallucinate, provide incorrect information, have a limited context window, and give different answers to the same question. LangExtract acts as an “intelligent bridge” to mitigate these weaknesses, transforming the ability to “understand text” into the ability to “extract reliable information.” A quick demo of an online chatbot will illustrate this point.

示例问题：“Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until he died in 2011.”在该系统的输出生成过程中，智能体会用 document_extractor_tool 抽取实体：它调用 LangExtract，并基于动态 few-shot 示例按查询关键词自动选择合适的抽取模板——例如检测到 “financial”“revenue”“company” 等关键词时将应用面向业务的示例，将公司名、人物、地点与日期等正确归类而非落入通用类别。实体抽取与关系抽取并行运行：系统通过文档上下文识别“创立于”“总部位于”“与…竞争”等关系。两者完成后，build_graph_data 会创建图结构：为每个唯一实体创建节点、为每个发现的关系创建边；若未检测到显式关系，则通过“related_to”作为稳健回退确保连通。最终用 Streamlit AgraphStreamlit的自定义组件，用于交互式图可视化，支持设置图的宽高、布局、物理引擎、层级结构等参数，并渲染交互式知识图谱。渲染交互式知识图谱，用户可探索公司、创始人、地点等之间的连接，系统在内存中运行、无需文件操作，并提供实时调试信息（实体与关系数量），支持针对科技公司与其相互关系的查询与结果过滤。

Example input: “Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until he died in 2011.” During the system's output generation, the agent uses the document_extractor_tool to extract entities: it calls LangExtract and automatically selects the appropriate extraction template based on dynamic few-shot examples according to query keywords. For instance, upon detecting keywords like “financial,” “revenue,” or “company,” it applies business-oriented examples to correctly categorize company names, people, locations, and dates instead of falling back to generic categories. Entity extraction and relationship extraction run in parallel: the system identifies relationships like “founded by,” “headquartered in,” and “competes with” through document context. After both are complete, build_graph_data creates the graph structure: creating a node for each unique entity and an edge for each discovered relationship. If no explicit relationship is detected, “related_to” serves as a robust fallback to ensure connectivity. Finally, the interactive knowledge graph is rendered using Streamlit AgraphStreamlit的自定义组件，用于交互式图可视化，支持设置图的宽高、布局、物理引擎、层级结构等参数，并渲染交互式知识图谱。. Users can explore connections between companies, founders, locations, etc. The system runs in memory without file operations, provides real-time debugging information (counts of entities and relationships), and supports queries and result filtering for tech companies and their interrelationships.

2 什么是 LangExtract？

LangExtract 是 Google 最新开源、公开可用的抽取特性，有望为开发者与数据团队“重建理性秩序”。它不只是“用 AI 抽信息”，而是将每次抽取与原文绑定。LangExtract 作为构建在 LLM 之上的“特殊机制”，针对信息抽取中的常见挑战（幻觉、不精确、有限上下文窗口、非确定性）最大化 LLM 的可用性。

LangExtract is Google's latest open-source, publicly available extraction feature, promising to “restore rational order” for developers and data teams. It is not just about “using AI to extract information” but about binding each extraction to the original text. As a “special mechanism” built on top of LLMs, LangExtract maximizes the utility of LLMs by addressing common challenges in information extraction (hallucination, imprecision, limited context windows, non-determinism).

2.1 LangExtract 有何特别之处？

LangExtract 的核心优势是 “程序化抽取LangExtract的核心特性，不仅精准识别所需信息，还能把每条抽取结果链接到原文的确切字符位置（offset），实现可追溯、可验证的结构化输出。”：不仅精准识别所需信息，还能把每条抽取结果链接到原文的确切字符位置（offset）。这种可追溯性允许对结果进行高亮与验证，显著提升交互式的数据可靠性。LangExtract 具备一系列强大功能：可通过分块、并行计算与多轮抽取高效处理“百万 token 级”的长文档以保证召回；直接生成结构化输出，从而无需传统 RAG 工作流里的切分与嵌入；同时兼容云端模型（如 Gemini）与本地开源大模型，并支持自定义提示模板，轻松适配不同领域。

The core advantage of LangExtract is “programmatic extraction”: it not only accurately identifies the required information but also links each extraction result to the exact character position (offset) in the original text. This traceability allows for highlighting and verification of results, significantly enhancing interactive data reliability. LangExtract boasts a range of powerful features: it can efficiently process “million-token-level” long documents through chunking, parallel computing, and multi-round extraction to ensure recall; it directly generates structured output, eliminating the need for segmentation and embedding in traditional RAG workflows; it is compatible with both cloud models (e.g., Gemini) and local open-source large models, and supports custom prompt templates for easy adaptation to different domains.

2.2 开始编码

我们逐步探索如何用 LangExtract 构建知识图谱式的聊天机器人。首先安装所需库：

Let's explore step-by-step how to build a knowledge graph-style chatbot using LangExtract. First, install the required libraries:

# 功能：安装依赖
# 说明：按原文示例，使用 requirements 列表安装依赖（原文命令含逗号拼写）
pip install -r requirements.txt

# Function: Install dependencies
# Description: As per the original example, install dependencies using a requirements list (original command contained a comma typo)
pip install -r requirements.txt

下一步是常规导入相关库。langextract 是一个 Python 库，基于用户定义的指令，利用 LLM 从非结构化文本中抽取结构化信息；streamlit_agraph 是 Streamlit 的自定义组件，用于交互式图可视化。

The next step is to import the relevant libraries. langextract is a Python library that uses LLMs to extract structured information from unstructured text based on user-defined instructions; streamlit_agraph is a custom Streamlit component for interactive graph visualization.

# 功能：导入依赖库
import os
import textwrap
import langextract as lx
import logging
import streamlit as st
from streamlit_agraph import Config, Edge, Node, agraph
from typing import List, Dict, Any, Optional
import json

# Function: Import dependency libraries
import os
import textwrap
import langextract as lx
import logging
import streamlit as st
from streamlit_agraph import Config, Edge, Node, agraph
from typing import List, Dict, Any, Optional
import json

现在创建 document_extractor_tool。该函数接收两段字符串：unstructured_text 与 user_query，返回一个 Python 字典，便于后续转为 JSON。内部先用 textwrap.dedent(...) 构造干净的提示词，明确模型的角色（抽取专家）、任务（抽取相关信息）与需关注的具体查询。随后准备 few-shot 示例来引导抽取：检测关键词后（财务、法律、社交/餐饮等），自动挑选对应示例以确保模型理解输出结构与抽取方式；若不匹配则使用通用示例。最后调用 lx.extract(...)，传入文本、提示、示例与保存在环境变量中的 API Key；记录日志便于调试，并规范化输出为包含 text、class 与 attributes 的字典列表。

Now, create the document_extractor_tool. This function takes two strings: unstructured_text and user_query, and returns a Python dictionary for easy conversion to JSON later. Internally, it first uses textwrap.dedent(...) to construct a clean prompt, clearly defining the model's role (extraction expert), task (extract relevant information), and the specific query to focus on. It then prepares few-shot examples to guide the extraction: after detecting keywords (financial, legal, social/dining, etc.), it automatically selects the corresponding example to ensure the model understands the output structure and extraction method; if no match is found, a generic example is used. Finally, it calls lx.extract(...), passing in the text, prompt, examples, and the API Key stored in an environment variable; logs are recorded for debugging, and the output is normalized into a list of dictionaries containing text, class, and attributes.

# 功能：基于用户查询的结构化信息抽取
# 说明：动态 few-shot 模板选择 + LangExtract 抽取 + 统一字典化输出
def document_extractor_tool(unstructured_text: str, user_query: str) -> dict:
    """
    从给定非结构化文本中，依据用户查询抽取结构化信息。
    返回包含抽取结果字典列表的对象，便于 JSON 化与下游处理。
    """
    prompt = textwrap.dedent(f"""
    You are an expert at extracting specific information from documents.
    Based on the user's query, extract the relevant information from the provided text.
    The user's query is: "{user_query}"
    Provide the output in a structured JSON format.
    """)
    examples = []
    query_lower = user_query.lower()
    if any(keyword in query_lower for keyword in ["financial", "revenue", "company", "fiscal"]):
        financial_example = lx.data.ExampleData(
            text="In Q1 2023, Innovate Inc. reported a revenue of $15 million.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="company_name",
                    extraction_text="Innovate Inc.",
                    attributes={"name": "Innovate Inc."},
                ),
                lx.data.Extraction(
                    extraction_class="revenue",
                    extraction_text="$15 million",
                    attributes={"value": 15000000, "currency": "USD"},
                ),
                lx.data.Extraction(
                    extraction_class="fiscal_period",
                    extraction_text="Q1 2023",
                    attributes={"period": "Q1 2023"},
                ),
            ]
        )
        examples.append(financial_example)
    elif any(keyword in query_lower for keyword in ["legal", "agreement", "parties", "effective date"]):
        legal_example = lx.data.ExampleData(
            text="This agreement is between John Doe and Jane Smith, effective 2024-01-01.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="John Doe",
                    attributes={"name": "John Doe"},
                ),
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="Jane Smith",
                    attributes={"name": "Jane Smith"},
                ),
                lx.data.Extraction(
                    extraction_class="effective_date",
                    extraction_text="2024-01-01",
                    attributes={"date": "2024-01-01"},
                ),
            ]
        )
        examples.append(legal_example)
    elif any(keyword in query_lower for keyword in ["social", "post", "feedback", "restaurant", "菜式", "評價"]):
        social_media_example = lx.data.ExampleData(
            text="I tried the new 'Taste Lover' restaurant in TST today. The black truffle risotto was amazing, but the Tiramisu was just average.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="restaurant_name",
                    extraction_text="Taste Lover",
                    attributes={"name": "Taste Lover"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="black truffle risotto",
                    attributes={"name": "black truffle risotto", "sentiment": "positive"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="Tiramisu",
                    attributes={"name": "Tiramisu", "sentiment": "neutral"},
                ),
            ]
        )
        examples.append(social_media_example)
    else:
        generic_example = lx.data.ExampleData(
            text="Juliet looked at Romeo with a sense of longing.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Juliet", attributes={"name": "Juliet"}
                ),
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Romeo", attributes={"name": "Romeo"}
                ),
                lx.data.Extraction(
                    extraction_class="emotion", extraction_text="longing", attributes={"type": "longing"}
                ),
            ]
        )
        examples.append(generic_example)
    logging.info(f"Selected {len(examples)} few-shot example(s).")
    result = lx.extract(
        text_or_documents=unstructured_text,
        prompt_description=prompt,
        examples=examples,
        api_key=os.getenv("GOOGLE_API_KEY")
    )
    logging.info(f"Extraction result: {result}")
    extractions = [
        {"text": e.extraction_text, "class": e.extraction_class, "attributes": e.attributes}
        for e in result.extractions
    ]
    return {
        "extracted_data": extractions
    }

# Function: Structured information extraction based on user query
# Description: Dynamic few-shot template selection + LangExtract extraction + unified dictionary output
def document_extractor_tool(unstructured_text: str, user_query: str) -> dict:
    """
    Extract structured information from the given unstructured text based on the user's query.
    Returns an object containing a list of extraction dictionaries for easy JSON serialization and downstream processing.
    """
    prompt = textwrap.dedent(f"""
    You are an expert at extracting specific information from documents.
    Based on the user's query, extract the relevant information from the provided text.
    The user's query is: "{user_query}"
    Provide the output in a structured JSON format.
    """)
    examples = []
    query_lower = user_query.lower()
    if any(keyword in query_lower for keyword in ["financial", "revenue", "company", "fiscal"]):
        financial_example = lx.data.ExampleData(
            text="In Q1 2023, Innovate Inc. reported a revenue of $15 million.",
            extractions=[