RAG如何超越文档检索，演化为动态优化引擎？

Large Language Models (LLMs) generate useful responses across a wide range of queries but often lack the specialized knowledge and skills needed for niche tasks. They do not have direct access to private or recently updated data that falls outside their pre-training corpus. Additionally, they can struggle with complex tasks, such as generating valid code or following precise domain-specific guidelines.

大型语言模型（LLMs）能够在广泛的查询中生成有用的回应，但通常缺乏处理特定领域任务所需的专业知识和技能。它们无法直接访问其预训练语料库之外的私有或最新数据。此外，在处理复杂任务时，例如生成有效代码或遵循精确的领域特定指南，它们可能会遇到困难。

One way to address these limitations is by augmenting the LLM’s prompt with additional context. This is commonly achieved using Retrieval Augmented Generation (RAG), where relevant information is retrieved from a database and added to the prompt at runtime, grounding the model’s response in external knowledge.

解决这些限制的一种方法是通过额外的上下文来增强 LLM 的提示。这通常通过检索增强生成（RAG）来实现，即从数据库中检索相关信息并在运行时将其添加到提示中，使模型的响应基于外部知识。

We introduce a new framework that generalizes RAG, transforming it from a retrieval system into a dynamic optimization process. By systematically refining how context is retrieved and composed, this framework enables LLMs to improve continuously, optimizing their outputs in real time without retraining.

我们引入了一个新的框架，它泛化了 RAG，将其从一个检索系统转变为一个动态优化过程。通过系统地改进上下文的检索和组合方式，该框架使 LLMs 能够持续改进，实时优化其输出，而无需重新训练。

RAG 的泛化

Retrieval Augmented Generation (RAG) enhances an LLM’s prompt by dynamically incorporating relevant external information. The most well-known implementation of this is Document-RAG, where document chunks are retrieved from a knowledge base based on embedding similarity and inserted into a system message to guide the model’s response.

检索增强生成（RAG）通过动态整合相关的外部信息来增强 LLM 的提示。其中最著名的实现是文档 RAG，即基于嵌入相似性从知识库中检索文档片段，并将其插入系统消息中以指导模型的响应。

While Document-RAG effectively integrates external knowledge, retrieval is not limited to documents. The same retrieval principles apply to other types of context—such as tailored instructions or few-shot examples—that refine an LLM’s ability to generate accurate and context-aware responses. To generalize RAG, it is essential to break down its core components and their roles in the system.

虽然文档 RAG 有效地整合了外部知识，但检索并不局限于文档。同样的检索原则适用于其他类型的上下文——例如定制指令或少样本示例——这些上下文可以提升 LLM 生成准确且具有上下文感知能力响应的能力。为了泛化 RAG，必须分解其核心组件及其在系统中的作用。

At a high level, a RAG system retrieves and composes context dynamically to construct an optimized prompt before passing it to the LLM. The key components of this process include:

从高层次来看，RAG 系统动态地检索和组合上下文，以构建一个优化的提示，然后将其传递给 LLM。此过程的关键组件包括：

Query: The input the LLM responds to, typically a user message. (查询：LLM 响应的输入，通常是用户消息。)
Prompt: The full input text provided to the LLM, including the query and additional context such as a system message, chat history, or few-shot examples. (提示：提供给 LLM 的完整输入文本，包括查询和额外的上下文，如系统消息、聊天历史或少样本示例。)
Response: The output generated by the LLM based on the prompt. (响应：LLM 基于提示生成的输出。)
Context: Any additional information retrieved to improve response quality. This includes document chunks, tailored instructions, and relevant few-shot examples. (上下文：为提升响应质量而检索的任何额外信息。这包括文档片段、定制指令和相关的少样本示例。)
Retriever: A function that selects the most useful contexts based on the query. (检索器：一个基于查询选择最有用的上下文的函数。)
Composer: A function that integrates the retrieved contexts into a structured prompt to maximize the LLM’s effectiveness. (组合器：一个将检索到的上下文整合到结构化提示中的函数，以最大化 LLM 的效能。)

Each of these components serves a distinct function. The retriever identifies relevant information that enhances the model’s response, while the composer determines how that information is framed within the prompt for maximum utility. Together, they dictate the effectiveness of RAG in guiding LLM behavior.

每个组件都有其独特的功能。检索器识别能够增强模型响应的相关信息，而组合器则决定如何在提示中构建这些信息以实现最大效用。它们共同决定了 RAG 在指导 LLM 行为方面的有效性。

By treating retrieval as a method for dynamically constructing prompts, rather than simply supplementing an LLM with external facts, RAG becomes a flexible optimization mechanism. Retrieving and composing any context that enhances response quality—whether factual data, domain-specific guidelines, or curated examples—extends RAG beyond static document retrieval, making it a more precise and adaptable tool for improving LLM outputs.

通过将检索视为一种动态构建提示的方法，而不仅仅是向 LLM 补充外部事实，RAG 就变成了一个灵活的优化机制。检索和组合任何能提升响应质量的上下文——无论是事实数据、领域特定指南还是精选示例——将 RAG 扩展到了静态文档检索之外，使其成为改进 LLM 输出的更精确、适应性更强的工具。

上下文学习

LLMs generate responses based on the information in their prompts, and some prompts are more effective than others. The goal is to construct prompts that maximize response accuracy by providing the most useful supporting context.

LLMs 根据其提示中的信息生成响应，而有些提示比其他提示更有效。目标是通过提供最有用的支持性上下文来构建能够最大化响应准确性的提示。

Retrieval plays a key role in optimizing prompts. Document-RAG enhances response quality by grounding the model in factual knowledge, but retrieval need not be limited to documents. Other forms of context—such as instructions and few-shot examples—also improve LLM outputs by shaping how the model processes and generates responses.

检索在优化提示中起着关键作用。文档 RAG 通过将模型基于事实知识来提高响应质量，但检索不必局限于文档。其他形式的上下文——例如指令和少样本示例——也可以通过塑造模型处理和生成响应的方式来改进 LLM 的输出。

However, instructions and few-shot examples are often static, meaning the same predefined context is used for every query. This rigid approach fails to account for variation across prompts. Different queries require different supporting contexts to produce the best responses.

然而，指令和少样本示例通常是静态的，这意味着每个查询都使用相同的预定义上下文。这种僵化的方法未能考虑到不同提示之间的差异。不同的查询需要不同的支持性上下文才能产生最佳响应。

RAG removes this limitation by making context dynamic. Instead of relying on fixed examples or generic instructions, retrieval selects the most useful guidance for each query, ensuring the LLM is always working with the most effective context available.

RAG 通过使上下文动态化来消除这一限制。检索不再依赖固定的示例或通用指令，而是为每个查询选择最有用的指导，确保 LLM 始终使用可用的最有效的上下文。

Diagram illustrating dynamic context selection in RAG

提示调优

Retrieval determines what information is available to an LLM, but not all retrieved context contributes equally to response quality. Traditional retrieval methods focus on similarity to the query, but similarity alone does not guarantee an improved response. What matters is utility—how much a retrieved context enhances the likelihood of a correct answer.

检索决定了 LLM 可以获取哪些信息，但并非所有检索到的上下文对响应质量的贡献都是均等的。传统的检索方法侧重于与查询的相似性，但仅凭相似性并不能保证改进响应。重要的是效用——即检索到的上下文提高正确答案可能性的程度。

Critically, this utility can be measured. Given a prompt, a correct response, and a piece of context, we can directly evaluate its impact on response accuracy. Since this effect is quantifiable, retrieval can be optimized to prioritize high-utility contexts—those that provide the most significant improvement—rather than simply retrieving the most related ones.

关键的是，这种效用是可以测量的。给定一个提示、一个正确的响应和一个上下文片段，我们可以直接评估它对响应准确性的影响。由于这种影响是可量化的，因此可以优化检索以优先考虑高效用上下文——那些能带来最显著改进的上下文——而不是简单地检索最相关的上下文。

A well-tuned retriever aligns retrieval with measured utility. The simplest way to do this is to bias the retrieval to prefer contexts with high utility. To improve on this, a model is trained which maps prompts and contexts into a space where similarity is related to utility. Two approaches are typically used here: one is to train a model, typically a small neural network, to learn a mapping from the space of pre-trained embeddings into a space where similarity is related to utility. The other approach is to use an LLM to rewrite prompts and contexts into a space where similarity is related to utility.

一个调优良好的检索器使检索与测量的效用保持一致。最简单的方法是使检索偏向于偏好高效用的上下文。为了改进这一点，可以训练一个模型，将提示和上下文映射到一个相似性与效用相关的空间中。这里通常使用两种方法：一种是训练一个模型（通常是一个小型神经网络），以学习从预训练嵌入空间到相似性与效用相关的空间的映射。另一种方法是使用 LLM 将提示和上下文重写到相似性与效用相关的空间中。

By shifting retrieval from similarity-based selection to utility-driven optimization, the process becomes more than just retrieving information—it acts as a fine-tuning mechanism that continuously improves model performance without modifying its weights. This allows LLMs to adapt in real time, making them more effective for specialized tasks and evolving information needs.

通过将检索从基于相似性的选择转变为效用驱动的优化，该过程不仅仅是检索信息——它充当了一种微调机制，能够在不修改模型权重的情况下持续提升模型性能。这使得 LLMs 能够实时适应，使其在处理专业任务和不断变化的信息需求时更加有效。

Diagram illustrating utility-driven retrieval optimization

形成闭环

Each time an LLM generates a response, the system stores an interaction, which consists of the query, the model’s response, and the resulting outcome—an observable effect that provides feedback on performance. Outcomes take various forms: a tool’s output, a follow-up message, an explicit correction, or a downstream KPI such as engagement or resolution rate.

每次 LLM 生成响应时，系统都会存储一个交互，其中包括查询、模型的响应以及产生的结果——一个可观察的、提供性能反馈的效果。结果有多种形式：工具的输出、后续消息、明确的更正，或者是下游的关键绩效指标（KPI），如参与度或解决率。

All interactions, both positive and negative, are stored and used by the system to adjust retrieval strategy and create high utility contexts in real time.

所有交互，无论是正面的还是负面的，都会被系统存储并用于实时调整检索策略并创建高效用上下文。

纠正负面结果

When an interaction results in an undesirable outcome—such as an incorrect response, hallucination, or ambiguity—the system generates a corrective instruction to guide future responses.

当交互导致不良结果时——例如错误的响应、幻觉或模糊不清——系统会生成一条纠正性指令来指导未来的响应。

Rather than relying on manually crafted instructions, rejection sampling constructs them automatically. An LLM generates multiple variations of a corrective instruction, evaluates their impact, and selects the most effective instruction. This is done looking through the stored interactions and measuring how much the instruction, when used as a context, would have improved the chances of generating responses with a positive outcome, while reducing the chances of generating responses with a negative outcome.

系统不依赖手动编写的指令，而是通过拒绝采样自动构建它们。LLM 生成纠正性指令的多个变体，评估它们的影响，并选择最有效的指令。这是通过查看存储的交互并测量该指令作为上下文使用时，能在多大程度上提高生成具有积极结果的响应的机会，同时降低生成具有消极结果的响应的机会来完成的。

These corrective instructions are stored in the retrieval database and dynamically injected into prompts for relevant queries, allowing the system to self-correct without modifying the model’s weights.

这些纠正性指令存储在检索数据库中，并动态注入到相关查询的提示中，使系统能够在无需修改模型权重的情况下进行自我纠正。

强化正面结果

When an interaction produces a strong response, it is stored as a candidate for few-shot examples, reinforcing effective behavior.

当交互产生一个强有力的响应时，它会被存储为少样本示例的候选，以强化有效行为。

Beyond the training phase, the retriever is continuously tuned on the fly to prioritize high-utility examples—examples that consistently improve response quality. As the dataset evolves, examples with broader utility become preferred, while those with diminishing impact are deprioritized. This ensures that few-shot learning remains adaptive, always retrieving the most effective examples.

在训练阶段之后，检索器会持续进行实时调优，以优先考虑高效用示例——那些能持续提升响应质量的示例。随着数据集的演变，具有更广泛效用的示例会变得优先，而那些影响减弱的示例则会被降低优先级。这确保了少样本学习保持适应性，始终检索最有效的示例。

In cases where explicit feedback is unavailable, the system assigns an implicit reward using long-term credit assignment—tracing system-wide KPIs back to individual interactions. For example, an LLM deployed on a website can use engagement metrics, resolution rates, or user retention as feedback signals, refining retrieval strategies even in the absence of direct corrections.

在无法获得明确反馈的情况下，系统使用长期信用分配来分配隐性奖励——将系统范围内的 KPI 追溯到单个交互。例如，部署在网站上的 LLM 可以使用参与度指标、解决率或用户留存率作为反馈信号，即使在缺乏直接纠正的情况下也能优化检索策略。

By embedding feedback directly into retrieval, this system closes the loop between interaction and improvement. The LLM continuously refines its responses, optimizes retrieval, and evolves over time—without manual updates or retraining. Instead of relying on static datasets or human intervention, the system learns from its own interactions, identifying high-utility contexts and dynamically adapting retrieval strategies to maximize performance.

通过将反馈直接嵌入到检索中，该系统在交互与改进之间形成了闭环。LLM 持续优化其响应、优化检索，并随着时间的推移而演进——无需手动更新或重新训练。系统不再依赖静态数据集或人工干预，而是从其自身的交互中学习，识别高效用上下文，并动态调整检索策略以最大化性能。

Diagram illustrating the closed-loop learning system

结论

Traditional supervised fine-tuning improves LLM performance by modifying model parameters, but it is costly, introduces long delays, and primarily learns from positive demonstrations rather than a full spectrum of feedback. In contrast, this retrieval-driven approach enables real-time learning from both successes and failures, integrating corrective instructions, few-shot examples, and long-term feedback signals to continuously refine responses.

传统的监督微调通过修改模型参数来提升 LLM 性能，但成本高昂、引入长延迟，并且主要从正面示范中学习，而非全面的反馈。相比之下，这种检索驱动的方法能够从成功和失败中实时学习，整合纠正性指令、少样本示例和长期反馈信号，以持续优化响应。

By embedding retrieval as an optimization layer, the system adapts dynamically, selecting the most effective context for each query and adjusting to new information without retraining. Corrective instructions address mistakes, while high-utility examples ensure adaptability, allowing the model to improve its reasoning over time.

通过将检索嵌入为优化层，系统能够动态适应，为每个查询选择最有效的上下文，并在无需重新训练的情况下适应新信息。纠正性指令处理错误，而高效用示例确保适应性，使模型能够随着时间的推移改进其推理能力。

This closes the loop between interaction and improvement, creating a system that learns autonomously from real-world usage. By shifting from static retrieval to continuous optimization, LLMs become more accurate, reliable, and responsive—bridging the gap between pre-trained knowledge and real-time adaptation.

这在交互与改进之间形成了闭环，创造了一个能够从实际使用中自主学习系统。通过从静态检索转向持续优化，LLMs 变得更加准确、可靠和响应迅速——弥合了预训练知识与实时适应之间的差距。

常见问题（FAQ）

RAG优化中如何实现动态优化而不需要重新训练模型？

通过将RAG泛化为动态优化引擎，系统能实时检索和组合上下文（如文档、指令、示例），持续改进LLM输出质量，无需重新训练模型。

RAG系统如何纠正错误结果并强化正确输出？

通过形成闭环优化流程：系统分析负面结果调整检索策略，强化正面结果优化上下文组合，实现自我纠正和持续改进。

除了文档检索，RAG还能利用哪些上下文提升效果？

RAG可检索定制指令、少样本示例等上下文，通过提示调优和上下文学习，增强LLM对专业任务和复杂指南的响应能力。

AI Summary (BLUF)