如何实现类人文档检索？PageIndex智能分析框架

📢 最新动态

🔥 产品发布

PageIndex Chat：首个专为专业长文档设计、具备类人文档分析能力的智能体平台。同时支持通过 MCP 或 API 进行集成（测试版）。

PageIndex Chat：The first human-like document-analysis agent platform built for professional long documents. Can also be integrated via MCP or API (beta).

📝 技术文章

PageIndex 框架：介绍 PageIndex 框架——一种基于智能体的上下文树状索引，使大语言模型能够在无需向量数据库或分块的情况下，对长文档执行基于推理的、类人化的检索。

PageIndex Framework: Introduces the PageIndex framework — an agentic, in-context tree index that enables LLMs to perform reasoning-based, human-like retrieval over long documents, without vector DB or chunking.

🧪 实践教程

无向量 RAG：一个使用 PageIndex 实现基于推理的 RAG 的极简实践示例。无需向量，无需分块，实现类人检索。
- Vectorless RAG: A minimal, hands-on example of reasoning-based RAG using PageIndex. No vectors, no chunking, and human-like retrieval.
基于视觉的无向量 RAG：无需 OCR，仅通过视觉的 RAG 流程，利用 PageIndex 原生支持推理的检索工作流，直接处理 PDF 页面图像。
- Vision-based Vectorless RAG: OCR-free, vision-only RAG with PageIndex's reasoning-native retrieval workflow that works directly over PDF page images.

📑 PageIndex 简介

在处理专业长文档时，您是否对向量数据库检索的准确性感到沮丧？传统的基于向量的 RAG 依赖于语义相似性而非真正的相关性。但相似性 ≠ 相关性——我们在检索中真正需要的是相关性，而这需要推理能力。当处理需要领域专业知识和多步推理的专业文档时，相似性搜索往往力不从心。

Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity ≠ relevance — what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.

受 AlphaGo 启发，我们提出了 PageIndex——一个无需向量、基于推理的 RAG 系统。它从长文档中构建分层树状索引，并利用大语言模型在该索引上进行推理，从而实现具备上下文感知能力的智能检索。

Inspired by AlphaGo, we propose PageIndex — a vectorless, reasoning-based RAG system that builds a hierarchical tree index from long documents and uses LLMs to reason over that index for agentic, context-aware retrieval.

它模拟了人类专家如何通过树搜索在复杂文档中导航和提取知识，使大语言模型能够通过思考和推理找到最相关的文档部分。PageIndex 的检索过程分为两步：

It simulates how human experts navigate and extract knowledge from complex documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. PageIndex performs retrieval in two steps:

为文档生成“目录”式的树状结构索引。
1. Generate a “Table-of-Contents” tree structure index of documents.
通过树搜索执行基于推理的检索模拟人类专家通过树搜索导航和从复杂文档中提取知识的方式，使检索过程可追溯和可解释，具有页面和章节引用。。
1. Perform reasoning-based retrieval through tree search.

🎯 核心特性

与传统基于向量的 RAG 相比，PageIndex 具备以下特性：

Compared to traditional vector-based RAG, PageIndex features:

无需向量数据库：利用文档结构和 LLM 推理进行检索，而非向量相似性搜索。
- No Vector DB: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
无需分块：文档按自然章节组织，而非人工分块。
- No Chunking: Documents are organized into natural sections, not artificial chunks.
类人化检索：模拟人类专家在复杂文档中导航和提取知识的方式。
- Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents.
更好的可解释性与可追溯性：检索基于推理——可追溯、可解释，并提供具体的页面和章节引用。告别不透明、近似的向量搜索（“氛围检索”）。
- Better Explainability and Traceability: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).

PageIndex 驱动的基于推理的 RAG 系统，在 FinanceBench 基准测试中取得了 98.7% 的顶尖准确率，在专业文档分析任务上展现出优于传统向量 RAG 解决方案的性能（详见我们的博客文章）。

PageIndex powers a reasoning-based RAG system that achieved state-of-the-art 98.7% accuracy on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis (see our blog post for details).

🛠️ 部署选项

自托管 — 使用此开源仓库在本地运行。
- Self-host — run locally with this open-source repo.
云服务 — 通过我们的聊天平台即时体验，或通过 MCP 或 API 集成。
- Cloud Service — try instantly with our Chat Platform, or integrate with MCP or API.
企业版 — 私有化或本地部署。请联系我们或预约演示以获取更多详情。
- Enterprise — private or on-prem deployment. Contact us or book a demo for more details.

🧪 快速上手

尝试 无向量 RAG 笔记本 — 一个使用 PageIndex 实现基于推理的 RAG 的极简实践示例。
- Try the Vectorless RAG notebook — a minimal, hands-on example of reasoning-based RAG using PageIndex.
体验 基于视觉的无向量 RAG — 无需 OCR；一个极简的、原生支持推理的 RAG 流程，可直接处理页面图像。
- Experiment with Vision-based Vectorless RAG — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.

🌲 PageIndex 树状结构

PageIndex 能够将冗长的 PDF 文档转换为语义树状结构，类似于“目录”，但针对大语言模型的使用进行了优化。它非常适用于：财务报告、监管文件、学术教科书、法律或技术手册，以及任何超出 LLM 上下文长度限制的文档。

PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

您可以使用此开源仓库生成 PageIndex 树状结构，或使用我们的 API。

You can generate the PageIndex tree structure with this open-source repo, or use our API.

⚙️ 使用指南

您可以按照以下步骤从 PDF 文档生成 PageIndex 树。

You can follow these steps to generate a PageIndex tree from a PDF document.

1. 安装依赖

1. Install dependencies

pip3 install --upgrade -r requirements.txt

2. 设置您的 OpenAI API 密钥

2. Set your OpenAI API key
在根目录创建 .env 文件并添加您的 API 密钥：
Create a .env file in the root directory and add your API key:

CHATGPT_API_KEY=your_openai_key_here

3. 在您的 PDF 上运行 PageIndex

3. Run PageIndex on your PDF

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

可选参数

Optional parameters
您可以使用额外的可选参数自定义处理过程：
You can customize the processing with additional optional arguments:

--model                 OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages       Pages to check for table of contents (default: 20)
--max-pages-per-node    Max pages per node (default: 10)
--max-tokens-per-node   Max tokens per node (default: 20000)
--if-add-node-id        Add node ID (yes/no, default: yes)
--if-add-node-summary   Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)

Markdown 支持

Markdown support
我们也为 PageIndex 提供 Markdown 支持。您可以使用 -md_path 参数为 Markdown 文件生成树状结构。
We also provide markdown support for PageIndex. You can use the -md_path flag to generate a tree structure for a markdown file.

python3 run_pageindex.py --md_path /path/to/your/document.md

Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our PageIndex OCR, which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
注意：在此功能中，我们使用“#”来确定节点标题及其层级。例如，“##”是 2 级，“###”是 3 级，依此类推。请确保您的 Markdown 文件格式正确。如果您的 Markdown 文件是从 PDF 或 HTML 转换而来，我们不建议使用此功能，因为大多数现有转换工具无法保留原始层次结构。相反，请使用我们专为保留原始层次结构而设计的 PageIndex OCR 将 PDF 转换为 Markdown 文件，然后再使用此功能。

📈 案例研究：PageIndex 领跑金融问答基准测试

Mafin 2.5 是一个用于金融文档分析的基于推理的 RAG 系统，由 PageIndex 驱动。它在 FinanceBench 基准测试中取得了 98.7% 的顶尖准确率，显著优于传统的基于向量的 RAG 系统。

Mafin 2.5 is a reasoning-based RAG system for financial document analysis, powered by PageIndex. It achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, significantly outperforming traditional vector-based RAG systems.

PageIndex 的分层索引和推理驱动的检索机制，能够从复杂的财务报告（如 SEC 文件和收益披露）中精确导航和提取相关上下文。

PageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.

请查阅完整的基准测试结果和我们的博客文章，以获取详细的比较和性能指标。

Explore the full benchmark results and our blog post for detailed comparisons and performance metrics.

🧭 资源

🧪 实践教程：可运行的动手示例和高级用例。
- 🧪 Cookbooks: hands-on, runnable examples and advanced use cases.
📖 教程：实用指南和策略，包括文档搜索和树搜索。
- 📖 Tutorials: practical guides and strategies, including Document Search and Tree Search.
📝 博客：技术文章、研究见解和产品更新。
- 📝 Blog: technical articles, research insights, and product updates.
🔌 MCP 设置与 API 文档：集成详情和配置选项。
- 🔌 MCP setup & API docs: integration details and configuration options.

⭐ 支持我们

如果您喜欢我们的项目，请为我们点亮一颗星 🌟。谢谢！

Leave us a star 🌟 if you like our project. Thank you!