RAG系统如何优化文档处理和向量检索？（附IBM Docling与重排序模型实战）

Now that we've established the basics in our "Crawl" phase, it's time to pick up the pace. In this guide, we'll move beyond the initial setup to focus on optimizing core architectural components for better performance and accuracy.

既然我们已经在“爬行”阶段打下了基础，现在是时候加快步伐了。在本指南中，我们将超越初始设置，专注于优化核心架构组件，以获得更好的性能和准确性。

架构演进：从“爬行”到“行走”

We ended the previous "Crawl" design with a functioning AI HR agent with a RAG system. The responses, however, could be better. I've introduced some new elements to the architecture to perform better document processing and chunking, as well as a re-ranker model to sort the semantic retrieval results by relevance:

在之前的“爬行”设计中，我们构建了一个具备 RAG 系统的 AI HR 助手。然而，其响应质量仍有提升空间。为此，我在架构中引入了一些新元素，以实现更好的文档处理和分块，以及一个重排序模型来根据相关性对语义检索结果进行排序：

优化后的 RAG 架构图

强大的文档处理工具：Docling

IBM's Docling is an open-source document processing tool and easily one of the most effective ones I've tested. It can convert various file formats (e.g., PDF, docx, HTML) into clean, structured formats like Markdown and JSON. By integrating AI models and OCR, it doesn't just extract text, but also preserves the original layout's integrity.
Through its hierarchical and hybrid chunking methods, Docling intelligently groups content by heading, merges smaller fragments for better context, and attaches rich metadata to streamline downstream searching and citations.

IBM 的 Docling 是一个开源的文档处理工具，也是我测试过的最有效的工具之一。它可以将各种文件格式（如 PDF、docx、HTML）转换为干净、结构化的格式，如 Markdown 和 JSON。通过集成 AI 模型和 OCR，它不仅提取文本，还能保持原始布局的完整性。
通过其分层和混合分块方法，Docling 能智能地按标题对内容进行分组，合并较小的片段以获得更好的上下文，并附加丰富的元数据以简化下游搜索和引用。

以下是一个用于分块处理 PDF 文件的 Python 函数示例：

def docling_chunk_pdf(file: str) -> tuple[list, list]:
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

    result = converter.convert(file)
    doc = result.document
    chunker = HybridChunker()
    chunks = list(chunker.chunk(doc))
    chunk_texts = [c.text for c in chunks]

    return chunks, chunk_texts

I plan to take a deeper dive into Docling in a future article, so give me a follow so you won't miss it! 😄

我计划在未来的文章中更深入地探讨 Docling，请关注我以免错过！😄

向量检索：点积与余弦相似度

In the "Crawl" post, I talked briefly about cosine similarity and how it ignores magnitude and only focuses on the angle between two vectors. This is because normalization is baked into the cosine similarity formula.
Dot product is effectively cosine similarity but without the final normalization step, which is why its result is affected by the magnitude of the vectors. Since many modern embedding models output pre-normalized unit vectors, the extra normalization step in cosine similarity becomes a redundant calculation. By using dot product on these pre-normalized vectors, you can achieve identical results with higher computational efficiency.

在“爬行”阶段的文章中，我简要提到了余弦相似度，它忽略向量的大小，只关注两个向量之间的角度。这是因为归一化步骤内置于余弦相似度的计算公式中。
点积本质上是没有最后归一化步骤的余弦相似度，因此其结果受向量大小的影响。由于许多现代嵌入模型输出的是预归一化的单位向量，余弦相似度中的额外归一化步骤就变成了冗余计算。在这些预归一化的向量上使用点积，可以获得相同的结果，同时计算效率更高。

关键注意事项

NOTE #1: While switching to dot product can increase your raw retrieval throughput, the latency gains may feel negligible when considering the entire end-to-end RAG pipeline depending on your particular use case and scale.

注意 #1： 虽然切换到点积可以提高原始检索吞吐量，但考虑到整个端到端 RAG 流水线，延迟的收益可能微乎其微，这取决于你的具体用例和规模。

NOTE #2: A friendly reminder that choosing dot product over cosine similarity has the hard requirement that your vectors be normalized beforehand, or the magnitude will skew your search results. It's also quite easy to update your search configuration to use one or the other. If you're ever in doubt, just run a quick test with both settings to verify that both methods return the exact same nearest neighbours (top semantic matches).

注意 #2： 温馨提醒，选择点积而非余弦相似度的硬性要求是向量必须事先归一化，否则向量大小会扭曲你的搜索结果。更新搜索配置以使用其中一种方法也非常容易。如果你不确定，只需用两种设置快速测试一下，验证两种方法是否返回完全相同的最近邻（顶级语义匹配）。

提升精度：重排序

Standard search is built for speed and not deep understanding, so it can sometimes miss nuances. Re-ranking takes a crucial "second look" at the standard retrieval results to see which one(s) actually address the user's query. While the Cosine distance represents how similar the query and document align in the vector space, a "close" match doesn't guarantee an answer. The re-ranker's job is to bridge this gap by scrutinizing the top results to assign a true relevance_score and ensure the most helpful contexts rise to the top.

标准搜索旨在追求速度而非深度理解，因此有时会忽略细微差别。重排序对标准检索结果进行关键的“二次审视”，以确定哪些结果真正解决了用户的查询。虽然余弦距离表示查询和文档在向量空间中的对齐相似度，但“接近”的匹配并不能保证提供答案。重排序器的工作就是通过仔细检查顶部结果来弥合这一差距，分配一个真正的相关性分数，并确保最有帮助的上下文排在前面。

以下是使用 Cohere API 进行重排序的代码片段：

co = cohere.ClientV2()

response = co.rerank(
    model=RERANKING_MODEL,
    query=user_query,
    documents=documents_to_rerank,
    top_n=len(candidate_responses),
)

reranked_results = []
for res in response.results:
    original_data = candidate_responses[res.index]

    reranked_results.append({
        "content": original_data["content"],
        "source": original_data["source"],
        "heading": original_data["heading"],
        "page": original_data["page"],
        "search_distance": original_data["search_distance"],
        "relevance_score": res.relevance_score
    })

As part of the full code that performs the re-ranking, I assign a threshold for the relevance score. Scores lower than this threshold are deemed irrelevant.

在执行重排序的完整代码中，我为相关性分数设定了一个阈值。低于此阈值的分数被视为不相关。

“行走”阶段的技术栈更新

I'm shaking up the stack for the "Walk" phase! In addition to using a different document processor, I will also be using a different embedding model and vector database.
Since I wanted to try out Cohere's re-ranking model, I opted to lean into their full suite and use their embedding model as well. I made a deliberate choice here to set the embedding dimension to 384, which is lower than the 768 I previously used in the "Crawl" example. I wanted to handicap the initial semantic search, and by doing so, we can more clearly see the re-ranker work its magic to fix the order of the results.
I switched out ChromaDB with LanceDB to showcase just how many robust, easy-to-use open-source local vector databases are available for use.

我正在为“行走”阶段更新技术栈！除了使用不同的文档处理器外，我还将使用不同的嵌入模型和向量数据库。
由于我想尝试 Cohere 的重排序模型，我选择全面采用他们的套件，也使用他们的嵌入模型。我特意将嵌入维度设置为 384，这低于我在“爬行”示例中使用的 768。我旨在削弱初始语义搜索的能力，通过这样做，我们可以更清楚地看到重排序器发挥作用，修正结果的顺序。
我将 ChromaDB 替换为 LanceDB，以展示有多少健壮、易用的开源本地向量数据库可供使用。

查询 HR 助手的效果

While I kept the core agent configuration from the "Crawl" phase the same, the addition of the re-ranking step made a significant impact. I asked the same two benchmark questions and this time the results were more refined and accurate:

虽然我保持了“爬行”阶段的核心代理配置不变，但重排序步骤的加入产生了显著影响。我提出了相同的两个基准问题，这次的结果更加精细和准确：

优化后的 HR RAG 助手响应示例

You can find the code for the "Walk" phase → here

你可以在 → 这里找到“行走”阶段的代码。

下一步：迈向规模化

Now that we've manually optimized our retrieval and re-ranking, the next step is to scale. I will be migrating this architecture to Vertex AI's RAG Engine for a fully managed, high-performance RAG pipeline at an enterprise scale.

既然我们已经手动优化了检索和重排序，下一步就是实现规模化。我将把此架构迁移到 Vertex AI 的 RAG Engine，以获得一个完全托管、高性能的企业级 RAG 流水线。

延伸学习

I used Cohere's embedding and re-ranking models in my example, but if you want to try out Vertex AI's re-ranking capabilities (and more), try out this Advanced RAG Techniques Codelab.

我在示例中使用了 Cohere 的嵌入和重排序模型，但如果你想尝试 Vertex AI 的重排序功能（以及其他功能），可以试试这个高级 RAG 技术 Codelab。

常见问题（FAQ）

在RAG系统中，为什么推荐使用点积代替余弦相似度进行向量检索？

因为现代嵌入模型通常输出预归一化的单位向量，使用点积可以避免余弦相似度中的冗余归一化计算，在结果相同的情况下显著提升计算效率。

IBM的Docling工具在文档处理方面有哪些优势？

Docling能将PDF、docx等多种格式转换为结构化数据，通过智能分块保留文档布局完整性，并附加丰富元数据，极大优化下游检索和引用流程。

重排序模型如何提升RAG系统的检索准确性？

重排序模型对语义检索的初步结果进行二次排序，根据相关性重新排列，能有效过滤噪声并提升最终返回答案的精确度。