GEO

PageIndex革命:基于推理的RAG框架如何超越向量搜索,实现98.7%准确率

2026/1/28
PageIndex革命:基于推理的RAG框架如何超越向量搜索,实现98.7%准确率
AI Summary (BLUF)

PageIndex introduces a revolutionary reasoning-based RAG framework that eliminates dependency on vector similarity search and document chunking. It organizes documents into hierarchical tree structures, enabling LLMs to navigate like human experts through multi-step reasoning, achieving 98.7% accuracy on FinanceBench. (PageIndex推出革命性的基于推理的RAG框架,彻底摆脱向量相似度搜索和文档分块的依赖。它将文档组织成层次化树状结构,使大语言模型能够像人类专家一样通过多步推理进行导航,在FinanceBench基准测试中达到98.7%的准确率。)

Introduction

While traditional RAG systems still rely on vector similarity search, a novel reasoning-based retrieval approach is quietly emerging.

当传统的RAG还在依赖向量相似度搜索时,一种全新的基于推理的检索方式正在悄然兴起。

Part 1: Breaking the Mold: The Dilemma of Traditional RAG and the Birth of PageIndex

Traditional RAG (Retrieval-Augmented Generation) systems based on vector databases face significant challenges when processing long documents, especially complex ones in specialized fields. The core issue—that similarity does not equal relevance—causes many seemingly advanced retrieval systems to underperform in practical applications.

在处理长文档,特别是专业领域的复杂文档时,传统的基于向量数据库的RAG(检索增强生成)系统正面临着严峻的挑战。相似度不等于相关性这一核心问题,使得许多看似先进的检索系统在实际应用中表现不佳。

When consulting financial reports, legal documents, or academic papers, what we truly need is not semantically similar text snippets, but precise information with high relevance. Determining this relevance often requires multi-step reasoning, which is precisely how human experts read documents.

当我们查阅金融报告、法律文件或学术论文时,真正需要的不是语义相似的文本片段,而是具有高度相关性的精确信息。这种相关性往往需要通过多步推理才能确定,这正是人类专家阅读文档时的思维方式。

Inspired by AlphaGo, the Vectify AI team introduced PageIndex—a revolutionary, reasoning-based RAG framework. It completely eliminates the need for vector databases and document chunking. Instead, by constructing a hierarchical tree-like index, it enables large language models to navigate document structures through reasoning, much like a human expert.

受AlphaGo的启发,Vectify AI团队推出了PageIndex——一个革命性的、基于推理的RAG框架。它完全无需向量数据库和文档分块,而是通过构建层次化的树状索引,让大语言模型能够像人类专家一样,通过推理在文档结构中导航。

Core Innovation: How PageIndex Works

The innovation of PageIndex lies in its redefinition of the entire document retrieval process. It adopts a reasoning-based retrieval approach, completely breaking free from dependence on vector similarity.

PageIndex的创新之处在于它重新定义了文档检索的整个流程,采用基于推理的检索方式,彻底摆脱了对向量相似度的依赖。

1. The Five Major Limitations of Traditional Vector RAG

Before understanding PageIndex, let's first clarify the core problems faced by traditional RAG systems:

在了解PageIndex之前,让我们先明确传统RAG系统面临的核心问题:

  1. Query-Knowledge Space Mismatch (查询与知识空间不匹配): Vector retrieval assumes that the most semantically similar text is also the most relevant. However, queries typically express intent, not the content itself. In professional documents, many paragraphs are semantically similar but differ vastly in relevance.
  2. Semantic Similarity ≠ Relevance (语义相似性不等同于相关性): Professional documents contain a large amount of content that is semantically similar but varies in importance. Traditional methods struggle to distinguish truly relevant information.
  3. Hard Chunking Destroys Semantic Integrity (硬分块破坏语义完整性): Fixed-size chunks (e.g., 512 or 1000 tokens) often cut off sentences, paragraphs, or chapters, leading to fragmented meaning and context.
  4. Inability to Integrate Conversation History (无法集成对话历史): Each query is processed independently; the retriever is unaware of what has been asked and answered previously.
  5. Difficulty Handling Intra-Document References (难以处理文档内引用): References such as "see Appendix G" or "refer to Table 5.3" are difficult to handle without additional preprocessing.

2. Tree-Structured Index: Intelligent Document Organization

PageIndex converts long documents into a semantic tree structure, similar to a table of contents but optimized specifically for LLMs.

PageIndex将长文档转换成语义化的树状结构,类似于目录索引,但专门为LLMs进行了优化:

{
  "node_id": "0006",
  "title": "Financial Stability",
  "start_index": "21",
  "end_index": "22",
  "summary": "The Federal Reserve ...",
  "sub_nodes": [
    {
      "node_id": "0007",
      "title": "Monitoring Financial Vulnerabilities",
      "start_index": "22",
      "end_index": "28",
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "node_id": "0008",
      "title": "Domestic and International Cooperation and Coordination",
      "start_index": "28",
      "end_index": "31",
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}

Each node contains:

  • node_id (唯一节点标识符): A unique node identifier.
  • title/name (人类可读的标签或标题): A human-readable label or title.
  • description/summary (节点的详细解释): A detailed explanation of the node.
  • metadata (上下文或属性的任意键值对): Arbitrary key-value pairs for context or attributes.
  • start_index/end_index (文档中的位置范围): The positional range within the document.
  • sub_nodes (子节点数组): An array of child nodes (recursive structure).

3. The Reasoning Loop: Simulating Human Expert Thinking

PageIndex's retrieval process mimics the natural way humans navigate and extract information from long documents.

PageIndex的检索过程模仿人类自然导航和提取长文档信息的方式:

  1. Read the Table of Contents (ToC) (阅读目录): Understand the document structure and identify potentially relevant chapters.
  2. Select a Chapter (选择章节): Choose the chapter most likely to contain useful information based on the question.
  3. Extract Relevant Information (提取相关信息): Parse the selected chapter to gather content that may help answer the question.
  4. Is the Information Sufficient? (信息是否充分?)
    • Yes → Proceed to Answer (是 → 进行回答问题)
    • No → Return to Step 1 (否 → 返回步骤1并选择另一个章节重复循环)
  5. Answer the Question (回答问题): Once sufficient information is gathered, generate a complete and well-supported answer.

This dynamic, iterative reasoning process allows the system to proactively decide where to look based on the evolving context of the question.

这种动态的迭代推理过程让系统能够基于问题的不断发展的上下文去主动决定去哪里查找。

Part 2: Tree Search Methods: Technical Implementation of Intelligent Navigation

Reasoning-based retrieval requires robust tree search algorithm support. PageIndex offers multiple tree search methods to adapt to different application scenarios.

基于推理的检索需要强大的树搜索算法支持。PageIndex提供了多种树搜索方法,以适应不同的应用场景。

1. LLM Tree Search: Intelligent Navigation Based on Reasoning

Basic Strategy
Uses an LLM agent to perform tree search, conducting retrieval based on reasoning. Its basic prompt template is:

使用LLM代理执行树搜索,基于推理进行检索。其基本提示模板为:

prompt = f"""
You are given a query and the tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Query: {query}
Document tree structure: {PageIndex_Tree}
Reply in the following JSON format:
{{
"thinking": <your reasoning about which nodes are relevant>,
"node_list": [node_id1, node_id2, ...]
}}
"""

Advanced Feature: Integrating Expert Knowledge
Unlike traditional vector-based RAG, PageIndex can integrate user preferences or expert knowledge simply by adding it to the LLM tree search prompt.

与传统基于向量的RAG不同,PageIndex可以通过简单地在LLM树搜索提示中添加知识来整合用户偏好或专业知识:

prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Query: {query}
Document tree structure: {PageIndex_Tree}
Expert Knowledge of relevant sections: {Preference}
Reply in the following JSON format:
{{
"thinking": <reasoning about which nodes are relevant>,
"node_list": [node_id1, node_id2, ...]
}}
"""

Example Expert Preference (示例专家偏好):

  • If the query mentions EBITDA adjustments, prioritize footnotes in Item 7 (MD&A) and Item 8 (Financial Statements) of a 10-K report.

2. Hybrid Tree Search: Balancing Speed and Accuracy

Background
LLM tree search has two main limitations:

  • Retrieval Speed (检索速度): LLM-based tree search can be slow due to the need for LLM reasoning.
  • Summary-Based Node Selection (基于摘要的节点选择): Relying solely on summaries may miss important details in the original content.

Value-Based Tree Search
Inspired by AlphaGo, it uses an AI model to predict a "value," representing the likelihood that a given node contains information relevant to the query.

受AlphaGo启发,使用AI模型预测值,表示给定节点的查询包含相关信息的可能性。

Steps to Build the Value Function (构建值函数的步骤):

  1. Chunking (分块): Each node is divided into several smaller chunks.
  2. Vector Search (向量搜索): The query is used to search for the top K most relevant chunks.
  3. Node Scoring (节点评分): For each retrieved chunk, its parent node is identified. A node's relevance score is calculated by aggregating the similarity scores of its associated chunks.

Node Scoring Rule (节点评分规则):
Formula: NodeScore = (1/√(N+1)) * Σ ChunkScore(n)

  • N is the number of content chunks associated with the node.
  • ChunkScore(n) is the relevance score of chunk n.
  • This scoring favors nodes with fewer but highly relevant chunks over nodes with many weakly relevant chunks.

Advantages of the Hybrid Method (混合方法的优势)

  • Combines Speed and Depth (结合速度与深度): Combines the speed of the value-based method with the depth of the LLM-based method.
  • Higher Recall (更高的召回率): Achieves higher recall than either method alone by leveraging their complementary strengths.
  • Fast, Relevant Results (快速相关结果): Delivers relevant results quickly without sacrificing accuracy or completeness.
  • Efficient Scaling (高效扩展): Scales efficiently for large document collections and complex queries.

3. Document Search Strategies for Different Scenarios

Metadata-Based Document Search (基于元数据的文档搜索)

  • Applicable Scenarios (适用场景): Documents that can be easily distinguished by metadata, such as financial reports (categorized by company and period) or legal documents (categorized by case type).
  • Implementation Process (实现流程):
    1. Upload all documents to PageIndex to obtain a doc_id.
    2. Set up an SQL table to store documents along with their metadata and PageIndex doc_id.
    3. Use an LLM to convert the user's retrieval request into an SQL query to fetch relevant documents.
    4. Use the retrieved documents' PageIndex doc_id for further retrieval.

Semantic-Based Document Search (基于语义的文档搜索)

  • Applicable Scenarios (适用场景): Documents covering diverse topics, requiring semantic understanding to identify relevant document sets.
  • Document Scoring Formula (文档评分公式): DocScore = (1/√(N+1)) * Σ ChunkScore(n)

Description-Based Document Search (基于描述的文档搜索)

  • Applicable Scenarios (适用场景): Lightweight strategy for a small number of documents.
  • Method (方法): Identify relevant documents through simple description matching or keyword search.

Part 3: Advantage Comparison: Why Similarity ≠ Relevance?

Vector RAG vs. Reasoning RAG (向量RAG vs 推理RAG对比)

Limitation (限制) Vector RAG Reasoning RAG
Query-Knowledge Mismatch (查询-知识不匹配) Matches surface similarity; often misses true context. Uses inference to identify the most relevant document parts.
Similarity ≠ Relevance (相似性≠相关性) Retrieves semantically similar but irrelevant chunks. Retrieves contextually relevant information.
Hard Chunking (硬分块) Fixed-length chunks disrupt meaning. Dynamically retrieves coherent sections.
No Chat Context (无聊天上下文) Each query is isolated. Multi-turn reasoning considers prior context.
Cross-References (交叉引用) Cannot follow internal document links. Follows intra-document references via ToC/PageIndex reasoning.

Core Advantages of PageIndex (PageIndex的核心优势)

  1. No Vector Database Required (无需向量数据库): Completely eliminates dependence on vector similarity search.
  2. No Document Chunking (无需文档分块): Documents are organized by natural sections, preserving integrity.
  3. Anthropomorphic Retrieval (拟人化检索): Simulates the document navigation style of human experts.
  4. Better Explainability (更好的可解释性): Clear, traceable retrieval paths provide page and chapter citations.
  5. Precise Relevance (精准相关性): Information matching based on reasoning, not similarity.
  6. Supports Multi-Turn Dialogue (支持多轮对话): Retrieval is context-aware, enabling coherent multi-turn exploration.
  7. Intelligent Reference Handling (智能引用处理): Can follow intra-document references like a human reader by leveraging ToC/PageIndex reasoning.

Part 4: Practical Validation: 98.7% Accuracy on FinanceBench

Data is the most convincing evidence. The Mafin 2.5 financial document analysis system, powered by PageIndex, achieved a 98.7% accuracy rate on the authoritative FinanceBench benchmark, significantly surpassing traditional vector-based RAG systems.

数据最有说服力。由PageIndex驱动的Mafin 2.5金融文档分析系统,在权威的FinanceBench基准测试中达到了98.7%的准确率,显著超越了传统的基于向量的RAG系统。

This achievement demonstrates the immense potential of reasoning-based retrieval methods in the field of professional document analysis. PageIndex has shown exceptional performance, particularly when handling complex financial reports such as SEC filings and earnings disclosures.

这一成绩证明了基于推理的检索方法在专业文档分析领域的巨大潜力。特别是在处理SEC文件、收益披露等复杂金融报告时,PageIndex展现出了卓越的性能。

Part 5: Quick Start: Building PageIndex in Three Steps

Developers can quickly get started with PageIndex through simple steps.

开发者可以通过简单的步骤快速上手PageIndex

# 1. Clone the GitHub repository
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
# 2. Install dependencies
pip3 install --upgrade -r requirements.txt
# 3. Set OpenAI API key
echo "CHATGPT_API_KEY=your_openai_key_here" > .env
# 4. Run PageIndex
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

From here, you can begin using PageIndex for intelligent document retrieval.

从此,你就可以开始使用PageIndex进行智能文档检索了。

Part 6: Application Scenarios: Which Documents Are Best Suited for PageIndex?

PageIndex is particularly well-suited for processing the following types of documents:

PageIndex特别适合处理以下类型的文档:

  • Financial Reports (金融报告): Annual reports, quarterly reports, SEC filings.
  • Regulatory Documents (监管文件): Regulations, compliance documents.
  • Academic Papers (学术论文): Long research papers, literature reviews.
  • Legal Documents (法律文档): Contracts, court judgments.
  • Technical Manuals (技术手册): API documentation, technical specifications.
  • Textbooks (教科书): Professional textbooks, reference books.

Any long document that exceeds the LLM's context limit and requires domain-specific knowledge and multi-step reasoning is an ideal application scenario for PageIndex.

任何超出LLM上下文限制、需要专业领域知识和多步推理的长文档,都是PageIndex的理想应用场景。

Part 7: Technical Outlook: The Future of Reasoning-Based RAG

PageIndex represents not just a tool, but a new paradigm for document retrieval. As the reasoning capabilities of large language models continue to improve, reasoning-based retrieval will gain even greater development space.

PageIndex代表的不仅仅是一个工具,更是一种全新的文档检索范式。随着大语言模型推理能力的不断提升,基于推理的检索将获得更大的发展空间:

  1. Multimodal Support (多模态支持): Vision RAG has already achieved vision-based document analysis.
  2. Agent Integration (智能体集成): Deep integration with various AI Agents through the MCP protocol.
  3. Domain Specialization (领域专业化): Customized optimization for vertical domains.

Conclusion: Reshaping the New Standard for Document Retrieval

PageIndex addresses the core pain points in the RAG field with a simple yet elegant approach. It teaches us that true intelligent retrieval should not stop at surface similarity but should delve into the level of relevance reasoning.

PageIndex用简单而优雅的思路解决了RAG领域的核心痛点。它告诉我们,真正的智能检索不应止步于表面相似,而应深入到相关性推理的层面。

When building the next generation of AI document analysis systems, PageIndex provides a successful paradigm worth emulating: simulating human thinking patterns, allowing AI to truly understand rather than merely match.

当我们在构建下一代AI文档分析系统时,PageIndex提供了一个可借鉴的成功范式:模拟人类的思维方式,让AI真正理解而非仅仅是匹配。

For developers and enterprises seeking more accurate and explainable document retrieval solutions, PageIndex is undoubtedly a new option worthy of in-depth exploration.

对于那些正在寻求更精准、更可解释的文档检索解决方案的开发者和企业来说,PageIndex无疑是一个值得深入探索的新选择。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。