AI智能体从演示到生产级系统，如何跨越优化鸿沟？

Q: OpenAI在优化RAG时遇到了什么挑战？

OpenAI为企业客户构建RAG管道时，基线准确率仅45%，经过约20次迭代才达65%，最终通过交叉编码器重排序、元数据优化等方法将准确率提升至98%。

引言：演示的轻松与生产的艰难

Creating a demo for Retrieval Augmented Generation (RAG) or agentic workflows is easy, but building a production-grade app is 10x harder, if not more. For every blog and tutorial claiming to get you started with RAG apps or agents in less than an hour, there are hundreds more talking about the complexity of building LLM, RAG, and AI systems that operate reliably at an acceptable accuracy and latency while staying within budget.

构建一个检索增强生成将外部知识检索与大语言模型生成相结合的技术，通过向量数据库存储和检索相关信息来增强模型的准确性和时效性。（RAG）或智能体工作流的演示很容易，但构建一个生产级的应用则要困难十倍，甚至更多。对于每一篇声称能让你在一小时内入门RAG应用或智能体的博客和教程，都有成百上千的文章在讨论构建LLM、RAG和AI系统的复杂性——这些系统需要在可接受的准确性、延迟和预算范围内可靠运行。

来自一线的“苦涩教训”

OpenAI的RAG优化之路

We don’t have to go far to see examples of this. OpenAI, on their Devday, no less, pointed towards the iterative nature of RAG optimization. While building a RAG pipeline for an enterprise client, OpenAI observed a baseline accuracy of 45%. The OpenAI engineers then tried multiple approaches to improve accuracy including trying Hypothetical Document Embeddings, Fine-Tuning Embeddings, and experimenting with different chunk sizes to better capture relevant information. After approximately 20 iterations, they had only reached 65% accuracy. At this point, they faced the decision to either abandon the project or continue optimizing. They chose to continue, applying cross-encoders to re-rank search results, using metadata to improve context relevance, doing further Prompt Engineering, integrating a tool (SQL database), and using Query Expansion. Eventually, by trying all these methods, and persisting with the ones that pushed the accuracy upward they managed to reach 98% accuracy.

我们无需远寻就能看到这样的例子。OpenAI在其开发者大会上就指出了RAG优化的迭代性质。在为一家企业客户构建RAG管道时，OpenAI观察到的基线准确率仅为45%。OpenAI的工程师随后尝试了多种方法来提高准确性，包括尝试假设文档嵌入一种通过生成假设文档来改进检索相关性的技术。、微调嵌入模型以及试验不同的文本块大小以更好地捕获相关信息。经过大约20次迭代后，他们仅将准确率提升到了65%。此时，他们面临是放弃项目还是继续优化的抉择。他们选择了继续，应用交叉编码器一种用于句子对分类或排序的神经网络模型，在RAG的重排序阶段用于计算查询与每个候选文档之间的精细相关性分数。对搜索结果进行重新排序，使用元数据提高上下文相关性，进行进一步的提示词工程，集成工具（SQL数据库），并使用查询扩展通过添加相关术语或重新表述来改进查询的技术。。最终，通过尝试所有这些方法，并坚持使用那些能提升准确率的方法，他们成功将准确率提升到了98%。

There is a (bitter?) lesson here: great gains can be made simply by depending on search and learning.

这里有一个（苦涩的？）教训：仅仅依靠搜索和学习就能取得巨大的收益。

NVIDIA的发现：RAG管道的15个控制点

We have seen this situation play out elsewhere as well. Only the names change. Nvidia released a paper in July 2024 with the title “FACTS About Building Retrieval Augmented Generation-based Chatbots”. I will not bore you with the details, but Nvidia mentioned that they identified 15 different control points in a RAG pipeline and each one of these control points impacts the quality of the results generated. They found that, among other parameters, choosing the right query rewriting strategy, chunk size, pre-processing technique, metadata enrichment, reranking, and LLM all mattered to the final performance. The retrieval relevance determined the accuracy of the LLM response. And retrieval relevance itself was dependent on metadata enrichment, chunking and query rephrasal. Again, Nvidia used a grid-search based approach to identify the parameter settings that resulted in the highest RAG accuracy.

我们在其他地方也看到了类似的情况，只是名称不同。英伟达在2024年7月发布了一篇题为《关于构建基于检索增强生成将外部知识检索与大语言模型生成相结合的技术，通过向量数据库存储和检索相关信息来增强模型的准确性和时效性。的聊天机器人的事实》的论文。在此不赘述细节，但英伟达提到，他们在RAG管道中识别出了15个不同的控制点，每一个都会影响生成结果的质量。他们发现，在其他参数中，选择合适的查询重写策略、文本块大小、预处理技术、元数据丰富向文档添加额外信息以提高检索准确性的过程。化、重新排序以及LLM本身都对最终性能至关重要。检索相关性决定了LLM响应的准确性，而检索相关性本身又依赖于元数据丰富向文档添加额外信息以提高检索准确性的过程。化、分块和查询改写。同样，英伟达使用了基于网格搜索一种通过系统尝试参数组合来寻找最优参数设置的优化方法。的方法来找到能带来最高RAG准确率的参数设置。

智能体工作流：更严峻的挑战

You might think that leveraging search to find optimal parameters is limited to RAG and that this does not apply to agentic workflows. In fact, with agents, the gap between a hacked demo and a reliable production grade is wider. As Richard Socher pointed out, if each step of an AI agent is 95% accurate. None of the 30 step work flows will work. Going from 95-> 99.9 is a similar last mile problem as with self driving cars.

你可能认为，利用搜索寻找最优参数仅限于RAG，而不适用于智能体工作流。事实上，对于智能体而言，一个临时拼凑的演示与一个可靠的生产级系统之间的鸿沟更大。正如Richard Socher所指出的：如果一个AI智能体An autonomous intelligent system that perceives its environment, makes decisions, and executes tasks, characterized by autonomy and adaptability.的每一步准确率是95%，那么一个30步的工作流将无一成功。将准确率从95%提升到99.9%，是一个与自动驾驶汽车类似的“最后一公里”难题。

智能体与RAG：核心相似性与关键差异

There has already been quite a bit written about agents. For our purpose, an agent is a model (e.g. LLM) that is able to call external tools. Multi-agents are multiple agents with access to multiple external tools (which may or may not be shared across agents). Given a user prompt, an agent uses an LLM to take actions such as calling a tool, and collects observations in a loop until a goal is met.

关于智能体已有不少论述。就本文而言，智能体是一个能够调用外部工具的模型（例如LLM）。多智能体则是多个可以访问多个外部工具（这些工具可能在智能体间共享，也可能不共享）的智能体。给定一个用户提示，智能体使用LLM来采取行动（例如调用工具），并在循环中收集观察结果，直到达成目标。

To accurately give a response to a user prompt, the agent must be able to do two things:

Retrieve precise and accurate context needed to answer the user’s query (retrieval)
Generate the correct response given that context (generation)

为了准确响应用户提示，智能体必须能够做两件事：

检索回答用户查询所需的精确且准确的上下文（检索）

根据该上下文生成正确的响应（生成）

If we look at the above two requirements, an agent does not look that different from Retrieval Augmented Generation. What is different here is that the context may be retrieved with the help of multiple tools instead of just a vector database. Also, the retrieval and generation may run in a loop before returning output to the user.

如果我们看上述两个要求，智能体看起来与检索增强生成将外部知识检索与大语言模型生成相结合的技术，通过向量数据库存储和检索相关信息来增强模型的准确性和时效性。并无太大不同。不同之处在于，上下文的检索可能需要借助多个工具，而不仅仅是向量数据库。此外，检索和生成可能会在一个循环中运行，然后才将输出返回给用户。

通往最优智能体之路：双重优化

测试时计算模型在推理时进行的额外计算，用于提高输出质量。与模型演进

As models progressively get better, e.g. using test time compute, the hyperparameter optimization load on the generation step should reduce. This happens as some of the search would have already been done at test time by models like o1. Meaning that the search space over which the hyperparameters for the generation step have to be tuned reduces.

随着模型逐渐改进（例如使用测试时计算模型在推理时进行的额外计算，用于提高输出质量。），生成步骤的超参数优化负担应该会减轻。这是因为像o1这样的模型在测试时已经完成了一部分搜索工作。这意味着需要为生成步骤调整超参数的搜索空间缩小了。

The availability of test time compute optimal models also makes the agents more robust because the generation model has been trained to not traverse paths that are unpromising. So it would have a higher probability of maintaining the correct trajectory towards the optimal response.

测试时计算模型在推理时进行的额外计算，用于提高输出质量。最优模型的可用性也使智能体更加鲁棒，因为生成模型经过训练，不会遍历没有希望的路经。因此，它更有可能保持通往最优响应的正确轨迹。

工具优化与端到端优化

The question still is, how will we get to an optimal agent? I expect, just the way Nvidia and OpenAI had to optimize a tool (retrieval) as part of optimizing the full RAG system, agent optimization will require tool optimization as part of the full agent optimization. And both tool as well as end-to-end optimization will require search and learning. The less certain you are about the optimality of your tool response, the more uncertain you will be about how to perform the end-to-end agent optimization.

问题仍然是，我们如何得到一个最优的智能体？我预计，就像英伟达和OpenAI必须优化工具（检索）作为优化整个RAG系统的一部分一样，智能体优化也需要将工具优化作为整体智能体优化的一部分。无论是工具优化还是端到端优化，都需要搜索和学习。你对工具响应的最优性越不确定，你对如何进行端到端智能体优化就越不确定。

Why is it important to perform both tool and end-to-end agent optimization? Well, every step in an agent can be a single point of failure. Increasing the reliability of each step ensures that we traverse down the right path to get to the goal instead of rolling down the wrong path down the hill. In other words, the error multiplies with each step, so we want to make sure that the error per step is as low as possible. And the longer the loop of the agent runs until the goal is met, the greater the chance of going down the wrong path.

为什么同时进行工具优化和端到端智能体优化很重要？因为智能体中的每一步都可能是一个单点故障。提高每一步的可靠性，能确保我们沿着正确的路径前进以达到目标，而不是沿着错误的路滚下山坡。换句话说，误差会随着每一步累积，因此我们希望确保每一步的误差尽可能低。智能体循环运行直到达成目标的时间越长，走上错误路径的几率就越大。

So it will pay to optimize the hyperparameters of the tool as much as you can, while still performing end-to-end agent optimization.

因此，在尽可能优化工具超参数的同时，仍然进行端到端的智能体优化，将是值得的。

关键优化维度对比

基于OpenAI和NVIDIA的经验，我们可以将影响RAG与智能体性能的关键优化维度总结如下：


优化维度	具体技术/参数	对性能的主要影响	优化方法示例
检索相关性	文本分块策略、元数据丰富向文档添加额外信息以提高检索准确性的过程。化、查询改写	决定LLM响应的准确性基础	网格搜索一种通过系统尝试参数组合来寻找最优参数设置的优化方法。、A/B测试不同分块大小与查询策略
结果排序	交叉编码器一种用于句子对分类或排序的神经网络模型，在RAG的重排序阶段用于计算查询与每个候选文档之间的精细相关性分数。重排序、相关性评分	提升返回给LLM的上下文质量	集成重排序模型，对初步检索结果进行精排
生成质量	LLM选择、提示词工程、测试时计算模型在推理时进行的额外计算，用于提高输出质量。	直接影响最终输出的准确性与可靠性	提示词迭代、使用具有推理能力的模型（如o1）
工具调用 (智能体)	工具选择逻辑、参数传递、错误处理	影响多步工作流的成功率与累积误差	对每个工具进行独立基准测试与优化
端到端流程	循环控制、超参数组合、故障恢复	决定整个系统在复杂任务上的整体效能	在真实场景下进行全流程测试与调优

Take me home

带我回家

常见问题（FAQ）

为什么AI智能体An autonomous intelligent system that perceives its environment, makes decisions, and executes tasks, characterized by autonomy and adaptability.从演示到生产级系统如此困难？

演示只需基础功能，而生产级系统需在准确性、延迟和预算内可靠运行，需对每个组件和端到端工作流进行广泛优化，微小改进对整体可靠性至关重要。

OpenAI在优化RAG时遇到了什么挑战？

OpenAI为企业客户构建RAG管道时，基线准确率仅45%，经过约20次迭代才达65%，最终通过交叉编码器一种用于句子对分类或排序的神经网络模型，在RAG的重排序阶段用于计算查询与每个候选文档之间的精细相关性分数。重排序、元数据优化等方法将准确率提升至98%。

智能体工作流为什么比RAG更具挑战性？

智能体工作流涉及多步骤，即使每步准确率达95%，30步工作流也可能失败，需将准确率提升至99.9%以上，类似自动驾驶的“最后一公里”难题。

AI Summary (BLUF)