如何让AI精准检索金融文档？PageIndex开源RAG框架革新

前言：传统RAG的困境

在使用大语言模型处理企业文档时，你是否遇到过这样的问题？

在使用 large language models to process corporate documents, have you ever encountered the following issues?

向量检索GEO采用的核心检索技术，与传统SEO的倒排索引机制不同，通过Embedding实现语义相似度计算。经常“文不对题”，检索到的内容看起来相似但实际不相关

Vector search often returns "irrelevant content that looks similar," where the retrieved information appears related but is actually not.
长文档被切成碎片后，上下文信息丢失严重

When long documents are split into chunks, severe context loss occurs.
检索结果难以解释，不知道为什么返回了这些内容

Search results are difficult to interpret, making it unclear why specific content was retrieved.
面对专业文档（财报、法规、技术手册），检索准确率直线下降

When dealing with professional documents (financial reports, regulations, technical manuals), retrieval accuracy plummets.

这些问题的根源在于：传统向量检索GEO采用的核心检索技术，与传统SEO的倒排索引机制不同，通过Embedding实现语义相似度计算。依赖的是“相似性”而非“相关性”。但在真实场景中，我们需要的是相关性，而相关性需要推理。

The root cause of these issues lies in the fact that traditional vector search relies on "similarity" rather than "relevance." In real-world scenarios, what we need is relevance, and relevance requires reasoning.

今天介绍的开源项目 PageIndex，正是为了解决这个痛点而生。

The open-source project PageIndex, introduced today, is born precisely to address this pain point.

PageIndex 核心理念：无向量、推理式RAG

PageIndex的设计灵感来源于AlphaGo——通过树搜索实现智能决策。它模拟人类专家阅读复杂文档的方式：

The design inspiration for PageIndex comes from AlphaGo—achieving intelligent decision-making through tree search. It mimics the way human experts read complex documents:

先看目录：了解文档整体结构

First, look at the table of contents: Understand the overall document structure.
定位章节：根据问题推理应该查看哪个部分

Locate the section: Reason about which part to examine based on the question.
逐层深入：在相关章节中继续细化查找

Drill down layer by layer: Continue refining the search within the relevant section.
精准定位：最终找到最相关的内容

Pinpoint accurately: Ultimately find the most relevant content.

工作原理：从“智能目录PageIndex将PDF文档自动转换成的层级树状结构，保留了文档的自然章节组织，为推理式检索提供导航基础。”到“推理式树搜索”

PageIndex分两步完成检索：

PageIndex completes retrieval in two steps:

第一步：构建“智能目录PageIndex将PDF文档自动转换成的层级树状结构，保留了文档的自然章节组织，为推理式检索提供导航基础。”

将PDF文档自动转换为层级树状结构，类似“目录”但更智能。这个结构不仅包含标题，还包含节点ID、页码范围和LLM生成的章节摘要。

Automatically convert PDF documents into a hierarchical tree structure, similar to a "table of contents" but more intelligent. This structure includes not only titles but also node IDs, page ranges, and chapter summaries generated by an LLM.

{
  "title": "账号与访问管理",
  "node_id": "0009",
  "start_index": 6,
  "end_index": 6,
  "summary": "本章节规定了账号申请、使用和访问控制的管理要求...",
  "nodes": [
    {
      "title": "账号管理",
      "node_id": "0010",
      "nodes": [
        {"title": "账号申请流程", "node_id": "0011"},
        {"title": "新员工账号申请", "node_id": "0012"},
        {"title": "权限变更申请", "node_id": "0013"}
      ]
    }
  ]
}

第二步：推理式树搜索

当用户提问时，LLM不是简单地做向量匹配，而是像人类专家一样进行推理。例如，对于问题“如何申请账号？”，LLM会推理：“用户问的是账号申请，我应该先看‘账号与访问管理’这个章节，然后进入‘账号管理’子章节，最后定位到‘账号申请流程’和‘新员工账号申请’...”。

When a user asks a question, the LLM doesn't simply perform vector matching. Instead, it reasons like a human expert. For example, for the question "How to apply for an account?", the LLM reasons: "The user is asking about account application. I should first look at the 'Account and Access Management' chapter, then enter the 'Account Management' sub-chapter, and finally locate 'Account Application Process' and 'New Employee Account Application'..."

实战案例：企业制度文档问答

让我们看一个真实的执行结果。

Let's look at a real execution result.

场景：对一份14页的《信息安全管理制度》文档进行问答。

Scenario: Conducting Q&A on a 14-page "Information Security Management System" document.

文档结构自动解析：PageIndex首先自动构建了文档的树状结构。

Automatic Document Structure Parsing: PageIndex first automatically constructs the tree structure of the document.

📄 文档: test.pdf
├── [0000] 目录 (页码 1-5)
├── [0001] 总则 (页码 5-5)
│   ├── [0002] 目的
│   ├── [0003] 适用范围
│   ├── [0004] 术语和定义
│   └── [0005] 职责与权限
├── [0009] 账号与访问管理 (页码 6-6)
│   ├── [0010] 账号管理
│   │   ├── [0011] 账号申请流程
│   │   │   ├── [0012] 新员工账号申请
│   │   │   └── [0013] 权限变更申请
│   │   └── [0014] 账号使用规范
│   └── [0017] 访问控制
...
共 63 个节点

问答执行：
用户提问：如何申请账号？
推理过程：LLM分析认为问题最相关的是“账号申请流程”和“新员工账号申请”等节点。
检索到的节点：

[0011] 账号申请流程 (页码 6-6)
[0012] 新员工账号申请 (页码 6-6)
[0010] 账号管理 (页码 6-6)

最终答案：

Q&A Execution:
User Question: How to apply for an account?
Reasoning Process: The LLM analyzes that the question is most relevant to nodes such as "Account Application Process" and "New Employee Account Application."
Retrieved Nodes:

[0011] Account Application Process (Page 6-6)

[0012] New Employee Account Application (Page 6-6)

[0010] Account Management (Page 6-6)

Final Answer:

账号申请方式如下：

新员工账号
- 由人力资源部在入职当天通过OA系统提交账号申请单
- 部门负责人在4小时内完成审批并给出审批意见
- 信息技术部在1个工作日内创建账号并提供账号信息
权限变更（如因职责调整、临时项目、岗位晋升/调动）
- 满足上述条件时提出申请
- 普通权限由部门负责人在1个工作日内审批

整个过程可解释、可追溯，清晰展示了检索路径和推理逻辑。

The entire process is explainable and traceable, clearly showing the retrieval path and reasoning logic.

PageIndex的核心优势与技术创新

1. 硬实力验证：98.7%的准确率

在金融领域权威基准测试 FinanceBench金融文档分析领域的基准测试，用于评估RAG系统在金融报告分析中的性能表现。 上，基于PageIndex的Mafin 2.5系统基于PageIndex构建的RAG系统，在FinanceBench测试中达到98.7%准确率，展示了PageIndex在金融文档处理中的卓越性能。达到了98.7% 的准确率，大幅领先传统向量RAG方案。这意味着在处理财报、SEC文件、法规文档等专业长文档时，PageIndex能提供远超传统方案的可靠性。

On the authoritative financial benchmark FinanceBench金融文档分析领域的基准测试，用于评估RAG系统在金融报告分析中的性能表现。, the Mafin 2.5 system based on PageIndex achieved an accuracy rate of 98.7%, significantly outperforming traditional vector-based RAG solutions. This means that when processing professional long documents such as financial reports, SEC filings, and regulatory documents, PageIndex can provide far greater reliability than traditional approaches.

2. 与传统向量RAG的对比

特性	传统向量RAG	PageIndex
检索方式	向量相似度匹配	LLM推理式树搜索
文档处理	暴力切片分块	保留自然章节结构
可解释性	黑箱，难以追溯	透明，显示推理路径
专业文档表现	容易“文不对题”	精准定位相关内容

Feature Traditional Vector RAG PageIndex

Retrieval Method Vector similarity matching LLM reasoning-based tree search

Document Processing Brute-force chunking Preserves natural chapter structure

Explainability Black box, difficult to trace Transparent, shows reasoning path

Performance on Professional Docs Prone to "irrelevant matches" Accurately locates relevant content

Feature	Traditional Vector RAG	PageIndex
Retrieval Method	Vector similarity matching	LLM reasoning-based tree search
Document Processing	Brute-force chunking	Preserves natural chapter structure
Explainability	Black box, difficult to trace	Transparent, shows reasoning path
Performance on Professional Docs	Prone to "irrelevant matches"	Accurately locates relevant content

3. 关键技术创新点

无需向量数据库：告别Pinecone、Milvus等向量数据库的部署和维护成本。

No vector database required: Eliminates the deployment and maintenance costs of vector databases like Pinecone and Milvus.
无需分块调参：不再纠结chunk size是512还是1024。

No chunking parameter tuning: No more struggling over whether the chunk size should be 512 or 1024.
类人检索逻辑：像专家一样“先看目录，再定位章节”。

Human-like retrieval logic: Mimics experts by "first looking at the table of contents, then locating the chapter."
支持Vision RAGPageIndex支持的功能，可直接处理PDF页面图像而无需OCR转换，扩展了对非文本化文档的处理能力。：可直接处理PDF页面图像，无需OCR。

Supports Vision RAGPageIndex支持的功能，可直接处理PDF页面图像而无需OCR转换，扩展了对非文本化文档的处理能力。: Can directly process PDF page images without the need for OCR.

客观分析：优缺点与适用场景

优点

准确性高：基于推理而非相似度，对专业文档的理解更准确，在金融、法律、技术等领域表现尤为突出。

High Accuracy: Based on reasoning rather than similarity, it understands professional documents more accurately, performing exceptionally well in fields like finance, law, and technology.
可解释性强：完整展示推理过程和检索路径，便于人工验证和调试，支持精确到页码的引用溯源。

Strong Explainability: Fully displays the reasoning process and retrieval path, facilitating manual verification and debugging, and supports citation tracing down to the page number.
部署简单：无需向量数据库，仅依赖OpenAI API（可以改造），支持本地部署。

Simple Deployment: No vector database needed, relies only on the OpenAI API (can be modified), and supports local deployment.
灵活性好：支持PDF和Markdown格式，可自定义模型和参数，提供API和MCP集成。

Good Flexibility: Supports PDF and Markdown formats, allows for custom models and parameters, and provides API and MCP integration.

缺点

成本考量：每次检索需要多次调用LLM进行推理，相比向量检索GEO采用的核心检索技术，与传统SEO的倒排索引机制不同，通过Embedding实现语义相似度计算。，Token消耗更高，对于简单查询可能“杀鸡用牛刀”。

Cost Considerations: Each retrieval requires multiple LLM calls for reasoning, resulting in higher token consumption compared to vector search. It might be "overkill" for simple queries.
延迟较高：树搜索需要多轮LLM调用，实时性要求极高的场景可能不适用。

Higher Latency: Tree search requires multiple rounds of LLM calls, which may not be suitable for scenarios with extremely high real-time requirements.
文档要求：对结构化程度较高的文档效果最佳，完全无结构的文档可能影响效果，超大规模文档（数百页以上）需要更多处理时间。

Document Requirements: Works best with well-structured documents. Completely unstructured documents may affect performance. Very large documents (hundreds of pages or more) require more processing time.
依赖LLM能力：检索质量受底层LLM能力影响，需要较强的推理能力支持。

Dependence on LLM Capability: Retrieval quality is influenced by the underlying LLM's capabilities and requires strong reasoning support.

适用场景

最佳适用场景：

Best Use Cases:

金融文档分析：年报、季报分析，SEC文件检索，财务数据提取。

Financial Document Analysis: Annual/quarterly report analysis, SEC filing retrieval, financial data extraction.
法律法规查询：合规性检查，法规条款定位，合同条款分析。

Legal and Regulatory Query: Compliance checks, locating regulatory clauses, contract clause analysis.
企业知识库：内部制度文档（如本文案例），操作手册查询，技术文档检索。

Enterprise Knowledge Base: Internal policy documents (as in this case), operational manual queries, technical document retrieval.
学术研究：教材内容检索，论文文献分析，技术规范查阅。

Academic Research: Textbook content retrieval, academic paper analysis, technical specification review.

不太适合的场景：

Less Suitable Scenarios:

实时聊天机器人：延迟敏感场景。

Real-time chatbots: Latency-sensitive scenarios.
海量短文本检索：向量检索GEO采用的核心检索技术，与传统SEO的倒排索引机制不同，通过Embedding实现语义相似度计算。更高效。

Massive short-text retrieval: Vector search is more efficient.
成本敏感项目：Token消耗较高。

Cost-sensitive projects: Higher token consumption.
非结构化内容：如社交媒体、聊天记录。

Unstructured content: Such as social media posts, chat logs.

快速上手

安装

Installation

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip3 install -r requirements.txt

配置

Configuration
创建 .env 文件，并填入你的OpenAI API密钥：
Create a .env file and enter your OpenAI API key:
```
CHATGPT_API_KEY=your_openai_key_here
```

运行

Run

# 生成文档树结构 | Generate document tree structure
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

# 本地检索验证 | Local retrieval verification
python3 local_retrieval.py --query "你的问题"

总结

PageIndex代表了RAG技术的一个重要演进方向：从“暴力相似性匹配”走向“智能推理式检索”。对于需要处理专业长文档、追求高准确率、重视可解释性的场景，PageIndex是一个值得认真考虑的选择。虽然在成本和延迟方面有所取舍，但其在准确性和可解释性上的优势，足以弥补这些不足。如果你正在为RAG的检索准确率头疼，不妨试试这个“让AI像人类专家一样阅读”的新思路。

PageIndex represents a significant evolutionary direction for RAG technology: moving from "brute-force similarity matching" to "intelligent reasoning-based retrieval." For scenarios that require processing professional long documents, pursue high accuracy, and value explainability, PageIndex is an option worth serious consideration. Although it involves trade-offs in cost and latency, its advantages in accuracy and explainability are sufficient to compensate for these shortcomings. If you are struggling with the retrieval accuracy of RAG, why not try this new approach that "enables AI to read like a human expert."