RAG系统如何优化？企业实战经验分享：查询生成与重排序策略

Introduction

Over the past eight months, my team and I have been deeply immersed in building and refining Retrieval-Augmented Generation (RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.) systems for production. We implemented RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses. for two distinct use cases: Usul AI, which processes 9 million pages, and a confidential enterprise legal AI application handling 4 million pages. This journey took us from initial optimism with quick prototypes to the sobering reality of subpar performance that only end-users could detect, ultimately leading to months of iterative improvements. This post distills our key learnings, ranking them by their impact on the final system's performance.

在过去的八个月里，我和我的团队深入参与了检索增强生成（RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.）系统的构建与生产环境优化。我们为两个不同的用例实施了RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.：处理900万页的Usul AI，以及一个处理400万页的机密企业级法律AI应用。这段旅程让我们从快速原型构建的初始乐观，跌落到只有最终用户才能察觉的性能不佳的严峻现实，最终通过数月的迭代改进才得以解决。本文提炼了我们的核心经验，并按其最终对系统性能的影响进行了排序。

What Actually Moved the Needle: High-ROI Improvements

Our initial prototype, built swiftly with popular frameworks, failed to meet real-world user expectations. The following changes, listed in order of their return on investment (ROI), were crucial in bridging that gap.

我们使用流行框架快速构建的初始原型未能满足真实世界的用户期望。以下按投资回报率（ROI）排序的改进，对于弥补这一差距至关重要。

Query Generation

We learned that a user's final query often fails to capture the full semantic context needed for retrieval. To address this, we implemented a step where an LLM reviews the conversation thread and generates multiple queries—combining both semantic and keyword-based approaches. These queries are processed in parallel, and their results are passed to a reranker. This strategy significantly expanded our retrieval coverage and reduced over-reliance on the computed scores from any single hybrid search.

我们认识到，用户的最终查询通常无法捕捉检索所需的完整语义上下文。为了解决这个问题，我们增加了一个步骤：让一个大语言模型（LLM）回顾整个对话线程，并生成多个查询——结合了语义和基于关键词的方法。这些查询被并行处理，其结果被传递给重排序Reranking，在初步检索结果基础上进行二次排序，通常使用专门的排序模型对候选文档片段重新评分，以提升最终结果的相关性。器。这一策略极大地扩展了我们的检索覆盖范围，并减少了对任何单一混合搜索计算分数的过度依赖。

Reranking

Integrating a reranker was arguably the highest-value addition to our codebase—a minimal change with an outsized impact. We observed dramatic shifts in chunk rankings, far beyond our initial expectations. A reranker can often compensate for suboptimal retrieval setups, provided you feed it a sufficient number of initial chunks. Through experimentation, we found an optimal configuration of passing 50 chunks to the reranker and receiving 15 as output.

集成一个重排序Reranking，在初步检索结果基础上进行二次排序，通常使用专门的排序模型对候选文档片段重新评分，以提升最终结果的相关性。器可以说是我们代码库中价值最高的增补——改动极小，影响巨大。我们观察到块排名的剧烈变化，远超最初的预期。只要提供足够数量的初始块，重排序Reranking，在初步检索结果基础上进行二次排序，通常使用专门的排序模型对候选文档片段重新评分，以提升最终结果的相关性。器通常可以弥补检索设置上的不足。通过实验，我们找到了一个最佳配置：向重排序Reranking，在初步检索结果基础上进行二次排序，通常使用专门的排序模型对候选文档片段重新评分，以提升最终结果的相关性。器传递 50个块，并接收 15个 作为输出。

Chunking Strategy

This aspect demands substantial effort and will likely consume most of your development time. We built custom chunking flows for both enterprises. The key is to deeply understand your data: manually review sample chunks, and ensure that a) chunks are not truncated mid-word or mid-sentence, and b) each chunk forms a logical, self-contained unit of information.

这方面需要投入大量精力，并可能消耗你大部分的开发时间。我们为两家企业都构建了自定义的分块流程。关键在于深入理解你的数据：手动审查样本块，并确保 a) 块不会在单词或句子中间被截断，以及 b) 每个块都是一个逻辑上自包含的信息单元。

Injecting Metadata into the LLM Context

Initially, we passed only the raw chunk text to the LLM. An experiment revealed that injecting relevant metadata (such as title, author, source) alongside the text substantially improved the quality of the context and the final answers generated by the LLM.

最初，我们只向LLM传递原始块文本。一项实验表明，注入相关元数据（如标题、作者、来源）与文本一起，能显著提升上下文的质量和LLM生成的最终答案。

Query Routing

We encountered many user questions that fell outside RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.'s core competency (e.g., "summarize this article," "who wrote this?"). To handle these efficiently, we built a lightweight router that detects such intents and redirects them to a simpler pipeline—typically an API call combined with an LLM—bypassing the full RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses. retrieval setup entirely.

我们遇到了许多超出RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.核心能力范围的用户问题（例如，“总结这篇文章”、“这是谁写的？”）。为了高效处理这些问题，我们构建了一个轻量级的路由器，用于检测此类意图，并将其重定向到一个更简单的流程——通常是API调用结合LLM——完全绕过完整的RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.检索设置。

Our Evolving Technical Stack

Our infrastructure choices evolved based on performance, cost, and specific feature needs.

我们的基础设施选择基于性能、成本和特定功能需求而不断演变。

Vector Database 向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.: Azure Cognitive Search → Pinecone → Turbopuffer (cost-effective with native keyword search support 性价比高，原生支持关键词搜索)
Document Extraction 文档提取: Custom solution 自定义解决方案
Chunking 分块: Unstructured.io (default 默认), Custom for enterprises (Note: Heard positive feedback about Chonkie 企业级自定义，注：听闻Chonkie评价不错)
Embedding Model 嵌入模型将文本转换为向量表示的模型，用于语义相似度计算。Semantic Router支持多种嵌入模型，如OpenAI、Cohere、HuggingFace等。: text-embedding-3-large (We did not extensively test alternatives 未广泛测试其他模型)
Reranker 重排序Reranking，在初步检索结果基础上进行二次排序，通常使用专门的排序模型对候选文档片段重新评分，以提升最终结果的相关性。器: None → Cohere 3.5 → Zerank (lesser-known but performed well 知名度较低但表现良好)
LLM 大语言模型: GPT-4.1 → GPT-5 → GPT-4.1 (Leveraged Azure credits 利用了Azure额度)

Conclusion & Open-Source Contribution

The path from a working RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses. prototype to a production-grade system is paved with iterative refinement, focused on foundational elements like query understanding, result reranking, and intelligent data chunking. To encapsulate these learnings and give back to the community, we have consolidated our approach into an open-source project: agentset-ai/agentset, released under the permissive MIT license. We welcome questions, feedback, and contributions.

从一个可运行的RAGRetrieval-Augmented Generation - an AI framework that combines information retrieval with language generation to produce more accurate and contextually relevant responses.原型到生产级系统的道路，铺满了迭代优化的过程，其重点在于查询理解、结果重排序Reranking，在初步检索结果基础上进行二次排序，通常使用专门的排序模型对候选文档片段重新评分，以提升最终结果的相关性。和智能数据分块等基础要素。为了整合这些经验并回馈社区，我们将我们的方法整合进一个开源项目：agentset-ai/agentset，该项目在宽松的MIT许可证下发布。我们欢迎任何问题、反馈和贡献。