如何精准搜索长文档？PageIndex推理RAG框架

引言

当你想到检索增强生成（RAG）时，脑海中往往首先出现的是大型向量数据库、一堆嵌入以及依赖余弦相似度的搜索。在实践中，这种方法往往难以应对金融分析师、法律团队和学术研究人员日常遇到的长篇结构化文档。

When you think of Retrieval-Augmented Generation (RAG), what often comes to mind first are large vector databases, a multitude of embeddings, and searches reliant on cosine similarity. In practice, this approach often struggles with the lengthy, structured documents routinely encountered by financial analysts, legal teams, and academic researchers.

核心理念：为什么选择 PageIndex？

PageIndex 的诞生旨在回答一个问题：向量数据库真的对高效 RAG 绝对必要吗？答案是根本不。

PageIndex was born to answer one question: Are vector databases truly indispensable for efficient RAG? The answer is a definitive no.

PageIndex 采用了一种根本不同的方法，其核心优势包括：

无向量 – PageIndex 并不将每一页或每个块转换为嵌入，而是构建一个层级树，与文档的自然章节（如目录）相映照。每个节点可以选择性地包含摘要。

No Vectors – Instead of converting every page or chunk into embeddings, PageIndex constructs a hierarchical tree that mirrors a document's natural sections (like a table of contents). Each node can optionally include a summary.
无分块 – 传统管道会将文档拆分为人为块，常常导致上下文被切断。PageIndex 保留完整章节，保持叙事连贯。

No Chunking – Traditional pipelines split documents into artificial chunks, often severing context. PageIndex preserves complete sections, maintaining narrative coherence.
类人检索 – 通过让 LLM 浏览树结构，PageIndex 模拟领域专家的阅读和推理过程。模型可以遍历层级，提出澄清问题并回溯——更像真人分析师。

Human-like Retrieval – By enabling the LLM to navigate the tree structure, PageIndex mimics the reading and reasoning process of a domain expert. The model can traverse the hierarchy, ask clarifying questions, and backtrack—much like a human analyst.
可解释性与可追溯性PageIndex提供的功能，确保每份答案都能追溯到具体的树节点和页面范围，为答案提供精确的位置参考和审计轨迹。 – 每份答案都能追溯到具体节点与页面，为开发者提供清晰的审计轨迹。

Explainability & Traceability – Every answer can be traced back to specific nodes and pages, providing developers with a clear audit trail.

核心概念

树索引一种JSON结构，每个节点包含标题、起止索引、摘要和子节点等元数据，形成与文档自然章节对应的层级树，作为LLM的定制化“目录”。

树索引一种JSON结构，每个节点包含标题、起止索引、摘要和子节点等元数据，形成与文档自然章节对应的层级树，作为LLM的定制化“目录”。是 PageIndex 的核心数据结构。它是一种 JSON 结构，每个节点包含元数据：标题、起止索引、摘要以及子节点。该树基本上成为为 LLM 定制的“目录”。

The Tree Index is the core data structure of PageIndex. It is a JSON structure where each node contains metadata: title, start/end indices, summary, and child nodes. This tree essentially becomes a "table of contents" customized for the LLM.

LLM 推理

LLM 并非使用最近邻查找，而是对树进行推理，执行树搜索。它会查询下一个探究的分支，然后挑选相关章节传递到后续流程。

Instead of using nearest-neighbor lookup, the LLM reasons over the tree, performing a tree search. It queries which branch to explore next, then selects relevant sections to pass to the downstream process.

可选配置

用户可以控制模型、每节点最大 token 数、深度以及是否包含节点 ID 或摘要。

Users can control the model, maximum tokens per node, depth, and whether to include node IDs or summaries.

快速安装与运行

# 1. Clone the repo
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex

# 2. Install dependencies
pip install -U -r requirements.txt

# 3. Set your OpenAI API key
# Create a `.env` in the repo root
# echo "CHATGPT_API_KEY=sk-…" > .env

# 4. Generate a tree for a PDF
python run_pageindex.py --pdf_path /path/to/your/document.pdf

可选参数可让你微调流程：

Optional parameters allow you to fine-tune the process:

--model gpt-4o – 选择模型 (Select model)
--toc-check-pages 20 – 扫描目录的页面数 (Number of pages to scan for table of contents)
--max-pages-per-node 10 – 每个节点可承载的最大页面数 (Maximum number of pages a node can hold)
--if-add-node-summary yes – 为每个节点添加简短摘要 (Add a brief summary for each node)

对于遵循标题约定（#、## 等）的 Markdown 文件，请使用 --md_path。

For Markdown files that follow heading conventions (#, ##, etc.), use --md_path.

用例

域	问题	PageIndex 帮助方式
金融	SEC 文件长达 200+ 页，且包含嵌套章节；嵌入往往忽略细节。	PageIndex 构建树结构，让 LLM 能精准定位章节，提高 FinanceBench 准确率至 98.7%。
法律	案例法与合约需要精确的段落引用。	树结构保留页面范围，答案可包含精确位置参考。
学术	研究论文具备多级子章节；按主题检索往往不可靠。	节点摘要指导 LLM 定位相关章节，提升引用准确性。
技术手册	固件文档包含表格和图示。	PageIndex 能通过无 OCR 的视觉 RAG 索引图像，直接提供页面图像上下文。

Domain Problem How PageIndex Helps

Finance SEC filings are 200+ pages long with nested sections; embeddings often miss details. PageIndex builds a tree, allowing the LLM to pinpoint sections precisely, boosting FinanceBench accuracy to 98.7%.

Legal Case law and contracts require precise paragraph citations. The tree structure preserves page ranges, enabling answers to include exact location references.

Academic Research papers have multi-level subsections; topic-based retrieval is often unreliable. Node summaries guide the LLM to relevant sections, improving citation accuracy.

Technical Manuals Firmware documentation contains tables and diagrams. PageIndex can index images via OCR-free visual RAG, providing direct page image context.

Domain	Problem	How PageIndex Helps
Finance	SEC filings are 200+ pages long with nested sections; embeddings often miss details.	PageIndex builds a tree, allowing the LLM to pinpoint sections precisely, boosting FinanceBench accuracy to 98.7%.
Legal	Case law and contracts require precise paragraph citations.	The tree structure preserves page ranges, enabling answers to include exact location references.
Academic	Research papers have multi-level subsections; topic-based retrieval is often unreliable.	Node summaries guide the LLM to relevant sections, improving citation accuracy.
Technical Manuals	Firmware documentation contains tables and diagrams.	PageIndex can index images via OCR-free visual RAG, providing direct page image context.

基准亮点

PageIndex 内置的 RAG 系统 Mafin 2.5基于PageIndex技术构建的领先RAG系统，专门用于金融报告分析，在行业基准测试中表现优异。在 FinanceBench 基准上实现了 98.7% 的准确率，远超多种基于向量的 RAG 系统。干净的树索引一种JSON结构，每个节点包含标题、起止索引、摘要和子节点等元数据，形成与文档自然章节对应的层级树，作为LLM的定制化“目录”。与基于推理的搜索相结合，消除了仅靠相似度检索的诸多缺陷。

PageIndex's built-in RAG system, Mafin 2.5基于PageIndex技术构建的领先RAG系统，专门用于金融报告分析，在行业基准测试中表现优异。, achieved 98.7% accuracy on the FinanceBench benchmark, significantly outperforming various vector-based RAG systems. The combination of a clean tree index and reasoning-based search eliminates many of the shortcomings of similarity-only retrieval.

集成方式

自托管 – 在本地运行 Python 仓库，在笔记本或服务器上均可运行。

Self-Hosted – Run the Python repository locally, on either a notebook or a server.
聊天平台 – VectifyAI 提供 ChatGPT 风格的界面，您可以即时体验。

Chat Platform – VectifyAI offers a ChatGPT-style interface for instant experimentation.
MCP/API – 以极少代码公开功能，集成到自身的流水线中。

MCP/API – Expose functionality with minimal code for integration into your own pipelines.

未来方向

多模态检索 – 将文本与图像节点结合，支持无 OCR 的 PDF 图像。

Multimodal Retrieval – Combine text and image nodes to support OCR-free PDF images.
细粒度摘要 – 使用更先进的摘要模型，为节点提供更优说明。

Fine-Grained Summarization – Use more advanced summarization models to provide better descriptions for nodes.
协作功能 – 允许多位用户注释节点路径并共享检索逻辑。

Collaboration Features – Allow multiple users to annotate node paths and share retrieval logic.

最终思考

PageIndex 展示了一个结构良好的索引与 LLM 推理相结合的方案，能够取代传统的基于向量的方法，适用于多种真实场景。对于希望在长文档中构建可靠、可解释 RAG 系统的开发者而言，该框架提供了极具吸引力的低代码方案，让用户始终处于监控之中——就像真正的人类专家。

PageIndex demonstrates that a well-structured index combined with LLM reasoning can replace traditional vector-based approaches for many real-world scenarios. For developers looking to build reliable, explainable RAG systems over long documents, this framework offers a compelling, low-code solution that keeps the user in the loop—much like a true human expert.