向量搜索已过时？推理式RAG如何实现98.7%准确率

Introduction

While traditional RAG systems still rely on vector similarity search, a novel reasoning-based retrieval approach is quietly emerging.

当传统的RAG还在依赖向量相似度搜索时，一种全新的基于推理的检索方式正在悄然兴起。

Part 1: Breaking the Mold: The Dilemma of Traditional RAG and the Birth of PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。

Traditional RAG (Retrieval-Augmented Generation) systems based on vector databases face significant challenges when processing long documents, especially complex ones in specialized fields. The core issue—that similarity does not equal relevance—causes many seemingly advanced retrieval systems to underperform in practical applications.

在处理长文档，特别是专业领域的复杂文档时，传统的基于向量数据库的RAG（检索增强生成）系统正面临着严峻的挑战。相似度不等于相关性这一核心问题，使得许多看似先进的检索系统在实际应用中表现不佳。

When consulting financial reports, legal documents, or academic papers, what we truly need is not semantically similar text snippets, but precise information with high relevance. Determining this relevance often requires multi-step reasoning, which is precisely how human experts read documents.

当我们查阅金融报告、法律文件或学术论文时，真正需要的不是语义相似的文本片段，而是具有高度相关性的精确信息。这种相关性往往需要通过多步推理才能确定，这正是人类专家阅读文档时的思维方式。

Inspired by AlphaGo, the Vectify AI team introduced PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。—a revolutionary, reasoning-based RAG framework. It completely eliminates the need for vector databases and document chunking. Instead, by constructing a hierarchical tree-like index, it enables large language models to navigate document structures through reasoning, much like a human expert.

受AlphaGo的启发，Vectify AI团队推出了PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。——一个革命性的、基于推理的RAG框架。它完全无需向量数据库和文档分块，而是通过构建层次化的树状索引，让大语言模型能够像人类专家一样，通过推理在文档结构中导航。

Core Innovation: How PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 Works

The innovation of PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 lies in its redefinition of the entire document retrieval process. It adopts a reasoning-based retrieval approach, completely breaking free from dependence on vector similarity.

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。的创新之处在于它重新定义了文档检索的整个流程，采用基于推理的检索方式，彻底摆脱了对向量相似度的依赖。

1. The Five Major Limitations of Traditional Vector RAG

Before understanding PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。, let's first clarify the core problems faced by traditional RAG systems:

在了解PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。之前，让我们先明确传统RAG系统面临的核心问题：

Query-Knowledge Space Mismatch (查询与知识空间不匹配): Vector retrieval assumes that the most semantically similar text is also the most relevant. However, queries typically express intent, not the content itself. In professional documents, many paragraphs are semantically similar but differ vastly in relevance.
Semantic Similarity ≠ Relevance (语义相似性不等同于相关性): Professional documents contain a large amount of content that is semantically similar but varies in importance. Traditional methods struggle to distinguish truly relevant information.
Hard Chunking Destroys Semantic Integrity (硬分块破坏语义完整性): Fixed-size chunks (e.g., 512 or 1000 tokens) often cut off sentences, paragraphs, or chapters, leading to fragmented meaning and context.
Inability to Integrate Conversation History (无法集成对话历史): Each query is processed independently; the retriever is unaware of what has been asked and answered previously.
Difficulty Handling Intra-Document References (难以处理文档内引用): References such as "see Appendix G" or "refer to Table 5.3" are difficult to handle without additional preprocessing.

2. Tree-Structured Index: Intelligent Document Organization

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 converts long documents into a semantic tree structure, similar to a table of contents but optimized specifically for LLMs.

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。将长文档转换成语义化的树状结构，类似于目录索引，但专门为LLMs进行了优化：

{
  "node_id": "0006",
  "title": "Financial Stability",
  "start_index": "21",
  "end_index": "22",
  "summary": "The Federal Reserve ...",
  "sub_nodes": [
    {
      "node_id": "0007",
      "title": "Monitoring Financial Vulnerabilities",
      "start_index": "22",
      "end_index": "28",
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "node_id": "0008",
      "title": "Domestic and International Cooperation and Coordination",
      "start_index": "28",
      "end_index": "31",
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}

Each node contains:

node_id (唯一节点标识符): A unique node identifier.
title/name (人类可读的标签或标题): A human-readable label or title.
description/summary (节点的详细解释): A detailed explanation of the node.
metadata (上下文或属性的任意键值对): Arbitrary key-value pairs for context or attributes.
start_index/end_index (文档中的位置范围): The positional range within the document.
sub_nodes (子节点数组): An array of child nodes (recursive structure).

3. The Reasoning Loop: Simulating Human Expert Thinking

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。's retrieval process mimics the natural way humans navigate and extract information from long documents.

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。的检索过程模仿人类自然导航和提取长文档信息的方式：

Read the Table of Contents (ToC) (阅读目录): Understand the document structure and identify potentially relevant chapters.
Select a Chapter (选择章节): Choose the chapter most likely to contain useful information based on the question.
Extract Relevant Information (提取相关信息): Parse the selected chapter to gather content that may help answer the question.
Is the Information Sufficient? (信息是否充分？)
- Yes → Proceed to Answer (是 → 进行回答问题)
- No → Return to Step 1 (否 → 返回步骤1并选择另一个章节重复循环)
Answer the Question (回答问题): Once sufficient information is gathered, generate a complete and well-supported answer.

This dynamic, iterative reasoning process allows the system to proactively decide where to look based on the evolving context of the question.

这种动态的迭代推理过程让系统能够基于问题的不断发展的上下文去主动决定去哪里查找。

Part 2: Tree Search Methods: Technical Implementation of Intelligent Navigation

Reasoning-based retrieval requires robust tree search algorithm support. PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 offers multiple tree search methods to adapt to different application scenarios.

基于推理的检索需要强大的树搜索算法支持。PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。提供了多种树搜索方法，以适应不同的应用场景。

1. LLM Tree Search: Intelligent Navigation Based on Reasoning

Basic Strategy
Uses an LLM agent to perform tree search, conducting retrieval based on reasoning. Its basic prompt template is:

使用LLM代理执行树搜索，基于推理进行检索。其基本提示模板为：

prompt = f"""
You are given a query and the tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Query: {query}
Document tree structure: {PageIndex_Tree}
Reply in the following JSON format:
{{
"thinking": <your reasoning about which nodes are relevant>,
"node_list": [node_id1, node_id2, ...]
}}
"""

Advanced Feature: Integrating Expert Knowledge
Unlike traditional vector-based RAG, PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 can integrate user preferences or expert knowledge simply by adding it to the LLM tree search prompt.

与传统基于向量的RAG不同，PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。可以通过简单地在LLM树搜索提示中添加知识来整合用户偏好或专业知识：

prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Query: {query}
Document tree structure: {PageIndex_Tree}
Expert Knowledge of relevant sections: {Preference}
Reply in the following JSON format:
{{
"thinking": <reasoning about which nodes are relevant>,
"node_list": [node_id1, node_id2, ...]
}}
"""

Example Expert Preference (示例专家偏好):

If the query mentions EBITDA adjustments, prioritize footnotes in Item 7 (MD&A) and Item 8 (Financial Statements) of a 10-K report.

2. Hybrid Tree Search: Balancing Speed and Accuracy

Background
LLM tree search has two main limitations:

Retrieval Speed (检索速度): LLM-based tree search can be slow due to the need for LLM reasoning.
Summary-Based Node Selection (基于摘要的节点选择): Relying solely on summaries may miss important details in the original content.

Value-Based Tree Search
Inspired by AlphaGo, it uses an AI model to predict a "value," representing the likelihood that a given node contains information relevant to the query.

受AlphaGo启发，使用AI模型预测值，表示给定节点的查询包含相关信息的可能性。

Steps to Build the Value Function (构建值函数的步骤):

Chunking (分块): Each node is divided into several smaller chunks.
Vector Search (向量搜索): The query is used to search for the top K most relevant chunks.
Node Scoring (节点评分): For each retrieved chunk, its parent node is identified. A node's relevance score is calculated by aggregating the similarity scores of its associated chunks.

Node Scoring Rule (节点评分规则):
Formula: NodeScore = (1/√(N+1)) * Σ ChunkScore(n)

N is the number of content chunks associated with the node.
ChunkScore(n) is the relevance score of chunk n.
This scoring favors nodes with fewer but highly relevant chunks over nodes with many weakly relevant chunks.

Advantages of the Hybrid Method (混合方法的优势)

Combines Speed and Depth (结合速度与深度): Combines the speed of the value-based method with the depth of the LLM-based method.
Higher Recall (更高的召回率): Achieves higher recall than either method alone by leveraging their complementary strengths.
Fast, Relevant Results (快速相关结果): Delivers relevant results quickly without sacrificing accuracy or completeness.
Efficient Scaling (高效扩展): Scales efficiently for large document collections and complex queries.

3. Document Search Strategies for Different Scenarios

Metadata-Based Document Search (基于元数据的文档搜索)

Applicable Scenarios (适用场景): Documents that can be easily distinguished by metadata, such as financial reports (categorized by company and period) or legal documents (categorized by case type).
Implementation Process (实现流程):
1. Upload all documents to PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 to obtain a doc_id.
2. Set up an SQL table to store documents along with their metadata and PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 doc_id.
3. Use an LLM to convert the user's retrieval request into an SQL query to fetch relevant documents.
4. Use the retrieved documents' PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 doc_id for further retrieval.

Semantic-Based Document Search (基于语义的文档搜索)

Applicable Scenarios (适用场景): Documents covering diverse topics, requiring semantic understanding to identify relevant document sets.
Document Scoring Formula (文档评分公式): DocScore = (1/√(N+1)) * Σ ChunkScore(n)

Description-Based Document Search (基于描述的文档搜索)

Applicable Scenarios (适用场景): Lightweight strategy for a small number of documents.
Method (方法): Identify relevant documents through simple description matching or keyword search.

Part 3: Advantage Comparison: Why Similarity ≠ Relevance?

Vector RAG vs. Reasoning RAG (向量RAG vs 推理RAG对比)

Limitation (限制)	Vector RAG	Reasoning RAG
Query-Knowledge Mismatch (查询-知识不匹配)	Matches surface similarity; often misses true context.	Uses inference to identify the most relevant document parts.
Similarity ≠ Relevance (相似性≠相关性)	Retrieves semantically similar but irrelevant chunks.	Retrieves contextually relevant information.
Hard Chunking (硬分块)	Fixed-length chunks disrupt meaning.	Dynamically retrieves coherent sections.
No Chat Context (无聊天上下文)	Each query is isolated.	Multi-turn reasoning considers prior context.
Cross-References (交叉引用)	Cannot follow internal document links.	Follows intra-document references via ToC/PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 reasoning.

Core Advantages of PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 (PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。的核心优势)

No Vector Database Required (无需向量数据库): Completely eliminates dependence on vector similarity search.
No Document Chunking (无需文档分块): Documents are organized by natural sections, preserving integrity.
Anthropomorphic Retrieval (拟人化检索): Simulates the document navigation style of human experts.
Better Explainability (更好的可解释性): Clear, traceable retrieval paths provide page and chapter citations.
Precise Relevance (精准相关性): Information matching based on reasoning, not similarity.
Supports Multi-Turn Dialogue (支持多轮对话): Retrieval is context-aware, enabling coherent multi-turn exploration.
Intelligent Reference Handling (智能引用处理): Can follow intra-document references like a human reader by leveraging ToC/PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 reasoning.

Part 4: Practical Validation: 98.7% Accuracy on FinanceBench金融文档分析领域的基准测试，用于评估RAG系统在金融报告分析中的性能表现。

Data is the most convincing evidence. The Mafin 2.5 financial document analysis system, powered by PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。, achieved a 98.7% accuracy rate on the authoritative FinanceBench金融文档分析领域的基准测试，用于评估RAG系统在金融报告分析中的性能表现。 benchmark, significantly surpassing traditional vector-based RAG systems.

数据最有说服力。由PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。驱动的Mafin 2.5金融文档分析系统，在权威的FinanceBench金融文档分析领域的基准测试，用于评估RAG系统在金融报告分析中的性能表现。基准测试中达到了98.7%的准确率，显著超越了传统的基于向量的RAG系统。

This achievement demonstrates the immense potential of reasoning-based retrieval methods in the field of professional document analysis. PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 has shown exceptional performance, particularly when handling complex financial reports such as SEC filings and earnings disclosures.

这一成绩证明了基于推理的检索方法在专业文档分析领域的巨大潜力。特别是在处理SEC文件、收益披露等复杂金融报告时，PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。展现出了卓越的性能。

Part 5: Quick Start: Building PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 in Three Steps

Developers can quickly get started with PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 through simple steps.

开发者可以通过简单的步骤快速上手PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。：

# 1. Clone the GitHub repository
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
# 2. Install dependencies
pip3 install --upgrade -r requirements.txt
# 3. Set OpenAI API key
echo "CHATGPT_API_KEY=your_openai_key_here" > .env
# 4. Run PageIndex
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

From here, you can begin using PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 for intelligent document retrieval.

从此，你就可以开始使用PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。进行智能文档检索了。

Part 6: Application Scenarios: Which Documents Are Best Suited for PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。?

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 is particularly well-suited for processing the following types of documents:

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。特别适合处理以下类型的文档：

Financial Reports (金融报告): Annual reports, quarterly reports, SEC filings.
Regulatory Documents (监管文件): Regulations, compliance documents.
Academic Papers (学术论文): Long research papers, literature reviews.
Legal Documents (法律文档): Contracts, court judgments.
Technical Manuals (技术手册): API documentation, technical specifications.
Textbooks (教科书): Professional textbooks, reference books.

Any long document that exceeds the LLM's context limit and requires domain-specific knowledge and multi-step reasoning is an ideal application scenario for PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。.

任何超出LLM上下文限制、需要专业领域知识和多步推理的长文档，都是PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。的理想应用场景。

Part 7: Technical Outlook: The Future of Reasoning-Based RAG

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 represents not just a tool, but a new paradigm for document retrieval. As the reasoning capabilities of large language models continue to improve, reasoning-based retrieval will gain even greater development space.

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。代表的不仅仅是一个工具，更是一种全新的文档检索范式。随着大语言模型推理能力的不断提升，基于推理的检索将获得更大的发展空间：

Multimodal Support (多模态支持): Vision RAG has already achieved vision-based document analysis.
Agent Integration (智能体集成): Deep integration with various AI Agents through the MCP protocol.
Domain Specialization (领域专业化): Customized optimization for vertical domains.

Conclusion: Reshaping the New Standard for Document Retrieval

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 addresses the core pain points in the RAG field with a simple yet elegant approach. It teaches us that true intelligent retrieval should not stop at surface similarity but should delve into the level of relevance reasoning.

PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。用简单而优雅的思路解决了RAG领域的核心痛点。它告诉我们，真正的智能检索不应止步于表面相似，而应深入到相关性推理的层面。

When building the next generation of AI document analysis systems, PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 provides a successful paradigm worth emulating: simulating human thinking patterns, allowing AI to truly understand rather than merely match.

当我们在构建下一代AI文档分析系统时，PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。提供了一个可借鉴的成功范式：模拟人类的思维方式，让AI真正理解而非仅仅是匹配。

For developers and enterprises seeking more accurate and explainable document retrieval solutions, PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。 is undoubtedly a new option worthy of in-depth exploration.

对于那些正在寻求更精准、更可解释的文档检索解决方案的开发者和企业来说，PageIndex一种模拟人类专家知识提取的AI搜索优化技术，通过将文档转换为树状结构索引，并利用大语言模型推理在索引树中搜索相关信息。无疑是一个值得深入探索的新选择。