GEO

PageIndex:基于推理的下一代RAG框架,准确率高达98.7%

2026/1/27
PageIndex:基于推理的下一代RAG框架,准确率高达98.7%
AI Summary (BLUF)

PageIndex is an open-source reasoning-based RAG framework that replaces vector similarity search with structured document trees and LLM reasoning, achieving 98.7% accuracy on FinanceBench by preserving document context and enabling transparent retrieval paths. (PageIndex 是一个开源的基于推理的 RAG 框架,它用结构化文档树和大模型推理取代向量相似度搜索,通过在 FinanceBench 上实现 98.7% 的准确率,保留了文档上下文并实现了透明的检索路径。)

引言

过去两年,RAG(检索增强生成)几乎成为了所有 AI 应用的标配。无论是智能客服、企业知识库,还是财务分析、法律文档问答,大家都在用同一套逻辑:把文档切块、向量化,然后通过余弦相似度去匹配,再把检索到的内容丢给大模型进行回答。

Over the past two years, RAG (Retrieval-Augmented Generation) has become almost a standard feature in all AI applications. Whether it's intelligent customer service, enterprise knowledge bases, financial analysis, or legal document Q&A, everyone follows the same logic: chunk documents, vectorize them, match via cosine similarity, and then feed the retrieved content to a large language model for answers.

这套方案简单有效,但问题也显而易见——当问题变得复杂、跨页甚至涉及多层逻辑时,向量相似度检索往往“南辕北辙”。

This approach is simple and effective, but its problems are also evident. When questions become complex, cross-page, or involve multi-layered logic, vector similarity retrieval often goes "in the opposite direction."

举个例子:你问“2023 年公司经营活动现金流的同比变化是多少?”传统 RAG 可能会找到包含“现金流”的一堆段落,却遗漏了关键的上下文:经营活动 vs 投资活动,2023 vs 2022。结果就是:相似度很高,但相关性很差。

For example: If you ask, "What is the year-over-year change in the company's operating cash flow for 2023?" a traditional RAG might retrieve paragraphs containing "cash flow" but miss the critical context: operating activities vs. investing activities, 2023 vs. 2022. The result is high similarity but poor relevance.

那么,有没有更像“人类专家”一样读文档的 AI 检索方式?最近开源的 PageIndex 就提供了一种全新的思路,它并不依赖向量数据库,而是通过文档的逻辑结构加上大模型推理,来实现更准确、更透明的检索。

So, is there an AI retrieval method that reads documents more like a "human expert"? The recently open-sourced PageIndex offers a novel approach. It doesn't rely on vector databases but instead uses the logical structure of documents combined with LLM reasoning to achieve more accurate and transparent retrieval.

PageIndex 概述

PageIndex 是由 VectifyAI 开源的一套 “基于推理的 RAG” 框架。它的核心理念是:文档不是一堆无序的段落,而是有层级结构的树。与其切块向量化,不如先提取出目录树,保持文档原始逻辑。当用户提问时,让大模型顺着这棵树“推理式检索”,逐步缩小范围,直到定位到相关节点。

PageIndex is an open-source "reasoning-based RAG" framework developed by VectifyAI. Its core philosophy is: a document is not a pile of unordered paragraphs but a tree with a hierarchical structure. Instead of chunking and vectorizing, it first extracts a table of contents tree, preserving the document's original logic. When a user asks a question, the large language model performs "reasoning-based retrieval" along this tree, gradually narrowing down the scope until it locates the relevant node.

这样一来,整个检索过程就像人类专家查阅报告一样:先看目录找到相关章节,再深入阅读关键段落,而不是在浩如烟海的文本里盲目搜索相似词。

In this way, the entire retrieval process resembles a human expert consulting a report: first, look at the table of contents to find relevant chapters, then delve into key paragraphs, rather than blindly searching for similar words in a vast sea of text.

PageIndex 的官方介绍和开源测试中,有几个亮点非常值得关注:

Several highlights from PageIndex's official introduction and open-source testing are particularly noteworthy:

  1. 不再切块,避免上下文丢失:传统 RAG 要把长文档“切块”才能送进向量数据库,而切块会打断语境。例如,一个表格前后的文字解释很可能被切开,导致检索时答非所问。 PageIndex 则直接保留完整结构,不切块,保证上下文连续。

    No More Chunking, Avoiding Context Loss: Traditional RAG must "chunk" long documents to feed them into a vector database, which disrupts context. For example, textual explanations before and after a table might be split, leading to irrelevant answers during retrieval. PageIndex directly preserves the complete structure without chunking, ensuring contextual continuity.

  2. 树状结构,透明可追溯PageIndex 的输出是一棵 JSON 目录树,每个节点都包含:标题、页码、摘要、子节点等。 当用户提出问题时,检索路径完全可见——你能清楚地看到系统是如何从“财务报表” → “现金流量表” → “经营活动现金流”一步步定位下去的。这在企业应用里尤其关键,因为答案不仅要对,还要能解释为什么对。

    Tree Structure, Transparent and Traceable: PageIndex's output is a JSON table-of-contents tree, where each node contains: title, page number, summary, child nodes, etc. When a user asks a question, the retrieval path is fully visible—you can clearly see how the system navigates from "Financial Statements" → "Cash Flow Statement" → "Operating Cash Flow." This is especially crucial in enterprise applications because answers must not only be correct but also explainable.

  3. 推理代替相似度匹配:在 PageIndex 中,检索不是“Top-K 相似度搜索”,而是“基于推理的树搜索”。 换句话说,它会考虑“哪个章节更有可能回答这个问题”,而不是单纯比对词语相似度。这让它在跨页、多条件问题上表现更好。

    Reasoning Replaces Similarity Matching: In PageIndex, retrieval is not "Top-K similarity search" but "reasoning-based tree search." In other words, it considers "which chapter is more likely to answer this question" rather than simply comparing word similarity. This makes it perform better on cross-page, multi-condition questions.

  4. 实测效果远超传统方案:在权威的财务文档 Benchmark——FinanceBench 上,PageIndex 驱动的模型(Mafin 2.5)取得了 98.7% 的准确率,远超基于向量数据库的主流 RAG 系统。 这意味着,在专业场景(财报、法律合同、技术手册)中,它几乎能做到“接近专家级”的表现。

    Empirical Results Far Surpass Traditional Solutions: On the authoritative financial document benchmark—FinanceBench—the model powered by PageIndex (Mafin 2.5) achieved an accuracy rate of 98.7%, far exceeding mainstream RAG systems based on vector databases. This means that in professional scenarios (financial reports, legal contracts, technical manuals), it can achieve performance "close to expert-level."

PageIndex 的工作原理

为了更直观地理解,我们可以看一下 PageIndex 的流程:

To understand more intuitively, let's look at PageIndex's workflow:

  1. OCR/解析文档:使用 PageIndex 自研的 OCR 模型(支持长上下文),把 PDF 或扫描件转成结构化文本,并保留层级标题、页码。

    OCR/Document Parsing: Uses PageIndex's proprietary OCR model (supports long context) to convert PDFs or scanned documents into structured text, preserving hierarchical headings and page numbers.

  2. 生成目录树(PageIndex Tree):文档被转换为一棵树,每个节点包含标题、摘要和子节点。这相当于把文档“知识地图化”。

    Generate Table of Contents Tree (PageIndex Tree): The document is converted into a tree, where each node contains a title, summary, and child nodes. This is equivalent to creating a "knowledge map" of the document.

  3. 用户提问 → 树搜索:当问题到来时,PageIndex 让大模型从树根开始推理,逐步筛选节点,直到找到最相关的分支。

    User Query → Tree Search: When a question arrives, PageIndex instructs the LLM to start reasoning from the root of the tree, gradually filtering nodes until it finds the most relevant branch.

  4. 返回节点上下文:最终不仅返回答案,还会附带原始节点内容和检索路径,方便验证。

    Return Node Context: Ultimately, it returns not only the answer but also the original node content and the retrieval path for easy verification.

这种方式,完全不同于“向量切块 → 相似度排序”的黑盒检索,更像一个逻辑可追踪的专家助理。

This approach is completely different from the black-box retrieval of "vector chunking → similarity ranking," resembling more of a logically traceable expert assistant.

PageIndex vs 传统 RAG:对比表

特性 PageIndex(基于推理) 传统 RAG(向量检索)
检索方式 树结构 + 推理 向量相似度
文档处理 保留原始结构,不切块 切块,打断上下文
可追溯性 路径透明,节点可定位 黑盒,难回溯
适用场景 专业文档、长文本、要求高准确率 海量数据、轻量级应用
性能指标 FinanceBench 98.7% 普遍远低于此
Feature PageIndex (Reasoning-based) Traditional RAG (Vector Retrieval)
Retrieval Method Tree Structure + Reasoning Vector Similarity
Document Processing Preserves original structure, no chunking Chunking, disrupts context
Traceability Transparent path, nodes are locatable Black box, difficult to backtrack
Applicable Scenarios Professional documents, long texts, high accuracy required Massive data, lightweight applications
Performance Metric FinanceBench 98.7% Generally much lower

PageIndex 更慢一些,但更准,也更值得信赖。

PageIndex is somewhat slower, but more accurate and more trustworthy.

快速实现

git clone https://github.com/VectifyAI/PageIndex.git

输出结果会包含一棵目录树,以及每个节点的结构化信息。也可以直接在命令行输入问题,得到答案和检索路径。

The output will include a table-of-contents tree and structured information for each node. You can also directly input questions in the command line to get answers and the retrieval path.

适用场景

PageIndex 特别适合这几类场景:

PageIndex is particularly suitable for the following types of scenarios:

  • 财务分析:跨页、跨表格的数据对比和逻辑判断。

    Financial Analysis: Cross-page, cross-table data comparison and logical judgment.

  • 法律合规:合同条款、法规文件的精确定位。

    Legal Compliance: Precise location of contract clauses and regulatory documents.

  • 科研文献:论文综述、长篇报告,避免切块丢失上下文。

    Scientific Literature: Paper reviews, long reports, avoiding context loss from chunking.

  • 技术手册/说明书:层级结构清晰、跨章节引用频繁。

    Technical Manuals/Instructions: Clear hierarchical structure, frequent cross-chapter references.

简单说:凡是长、复杂、逻辑性强的文档,PageIndex 都能发挥优势。

Simply put: PageIndex excels with any document that is long, complex, and logically structured.

结论与展望

RAG 的瓶颈越来越明显,特别是在企业级场景,“相关性”比“相似度”重要得多。PageIndex 的出现,给我们展示了一条全新的道路:让检索更像推理,而不是搜索。

The limitations of RAG are becoming increasingly apparent, especially in enterprise scenarios where "relevance" is far more important than "similarity." The emergence of PageIndex shows us a new path: making retrieval more like reasoning than searching.

它的意义在于:不再只是让 AI 背诵段落,而是让 AI 真正学会“读懂文档”。

Its significance lies in: no longer just making AI memorize paragraphs, but enabling AI to truly "understand documents."

未来,当我们谈起 RAG 时,可能会有两条路线:

In the future, when we talk about RAG, there might be two paths:

  • 向量派:追求快速、轻量,适合大规模简单问答。

    Vector Faction: Pursues speed and lightness, suitable for large-scale simple Q&A.

  • 推理派:追求准确、透明,适合高价值专业应用。

    Reasoning Faction: Pursues accuracy and transparency, suitable for high-value professional applications.

PageIndex,正是推理派的代表。

And PageIndex is a representative of the reasoning faction.

对于研究者、开发者和企业用户来说,这个开源项目值得深入研究。也许在不远的将来,它会成为下一代 RAG 的“标配”。

For researchers, developers, and enterprise users, this open-source project is worth in-depth study. Perhaps in the near future, it will become the "standard" for the next generation of RAG.

(Note: The original input contained extensive promotional content about AI learning courses and materials following the technical discussion of PageIndex. In accordance with the requirement to focus on rewriting the core technical content into a high-quality blog post, the promotional sections have been omitted from this output. The response concludes at the natural end of the technical analysis.)

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。