RAG检索增强生成技术如何让大语言模型实时查阅文档？

If you've ever wondered how ChatGPT-style apps can suddenly "know" about your company's internal documents, product manuals, or legal files without being retrained, the answer is almost always RAG — Retrieval-Augmented Generation. In this post, we'll break down what RAG is, why it exists, and walk through the full pipeline step-by-step with a real example.

你是否曾好奇，像 ChatGPT 这样的应用是如何在无需重新训练的情况下，突然“知道”你公司的内部文档、产品手册或法律文件的？答案几乎总是 RAG——检索增强生成。在本文中，我们将剖析 RAG 是什么、它为何存在，并通过一个真实示例逐步讲解其完整流程。

1. 什么是 RAG？

Retrieval-Augmented Generation (RAG) is an AI framework that integrates an information retrieval component into the generation process of Large Language Models (LLMs) to improve factuality and relevance.

检索增强生成 是一种人工智能框架，它将一个信息检索组件集成到大语言模型的生成过程中，以提高事实准确性和相关性。

In plain English:

用通俗的话来说：

Instead of making the LLM remember everything, we let it look things up in a knowledge base right before answering.

我们不是让大语言模型记住所有东西，而是允许它在回答前查阅知识库。

The term RAG was coined in a 2020 research paper by Patrick Lewis et al. ("Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks") published on arXiv. The core insight: combine a parametric memory (the LLM's weights) with a non-parametric memory (a searchable document store) — and you get the best of both worlds.

RAG 这一术语由 Patrick Lewis 等人在 2020 年 发表于 arXiv 的研究论文（《用于知识密集型 NLP 任务的检索增强生成》）中提出。其核心洞见是：将参数化记忆（LLM 的权重）与非参数化记忆（可搜索的文档存储）相结合，从而获得两者的优势。

2. 为什么需要 RAG？动机何在？

Three big problems drove the invention of RAG:

三个主要问题催生了 RAG 的发明：

LLM 的局限性

LLMs are frozen snapshots. Once a model is trained, it only knows what was in its training data. It doesn't know:

大语言模型是静态的快照。一旦模型训练完成，它只知道其训练数据中的内容。它不知道：

What your company policies say (你公司的政策是什么)
What happened after its training cutoff (其训练截止日期之后发生了什么)
What's in your private documents (你的私有文档里有什么)
What yesterday's sales numbers were (昨天的销售数据是多少)

And even with what it does know, it can hallucinate confidently.

即使对于它确实知道的内容，它也可能自信地产生幻觉（编造信息）。

重新训练与动态检索的成本对比

You could retrain or fine-tune the model every time your data changes. But:

你可以在每次数据变化时重新训练或微调模型。但是：

Retraining a large model can cost tens of thousands to millions of dollars (重新训练一个大模型的成本可能高达数万到数百万美元)
It takes days or weeks (这需要数天或数周时间)
You have to do it again every time the data updates (每次数据更新时你都必须重新做一遍)

Dynamic retrieval (looking things up at query time) is vastly cheaper and always up-to-date.

动态检索（在查询时查找信息）的成本极低，并且始终保持最新。

对可靠、最新知识的需求

For regulated industries (finance, healthcare, legal), you can't ship answers that come from "the model's memory." You need answers backed by sources you can cite and audit.

对于受监管的行业（金融、医疗、法律），你不能提供来自“模型记忆”的答案。你需要有可以引用和审计的来源支持的答案。

RAG addresses all three challenges by decoupling knowledge from the model.

RAG 通过将知识与模型解耦，解决了所有这三个挑战。

3. RAG 全流程解析 —— 结合真实示例逐步讲解

This is the part most tutorials rush through. We're going to slow down.

这是大多数教程匆匆带过的部分。我们将放慢节奏，详细讲解。

Let's use a concrete example. Imagine you're building an internal developer assistant at a company called Acme Corp. Employees can ask it questions about the engineering handbook, API docs, and on-call runbooks.

让我们用一个具体例子。假设你正在为一家名为 Acme Corp 的公司构建一个内部开发者助手。员工可以向它询问关于工程手册、API 文档和值班手册的问题。

A developer asks:

一位开发者提问：

"How do I rotate the database credentials for the billing service?"

“如何为计费服务轮换数据库凭据？”

Here's exactly what happens behind the scenes.

以下是幕后发生的具体过程。

阶段 1：索引（一次性完成，提前进行）

Before anyone can ask anything, we need to prepare the knowledge base.

在任何人提问之前，我们需要准备好知识库。

步骤 1a —— 知识语料库

First, we gather every document we want the assistant to know about:

首先，我们收集我们希望助手了解的所有文档：

The engineering handbook (Markdown files) (工程手册 - Markdown 文件)
API documentation (HTML + Swagger specs) (API 文档 - HTML + Swagger 规范)
Runbooks (Confluence pages) (操作手册 - Confluence 页面)
Past incident post-mortems (Google Docs) (过往事故复盘报告 - Google Docs)
Security policies (PDFs) (安全策略 - PDF)

Let's say this gives us 8,000 documents.

假设这为我们提供了 8,000 份文档。

步骤 1b —— 文档分块将文档分割成较小片段的过程，便于后续处理和检索。

An LLM can't efficiently search through a 50-page PDF. And you don't want to return a whole 50-page PDF to the user either — you want the one paragraph that actually answers their question.

大语言模型无法高效地搜索一个 50 页的 PDF。你也不想将整个 50 页的 PDF 返回给用户——你想要的是真正回答他们问题的那一段。

So we chunk each document into smaller pieces. A common approach:

因此，我们将每个文档分块成更小的片段。一种常见的方法是：

500 tokens per chunk (~300 words) (每块 500 个 token，约 300 字)
50 token overlap between chunks (so we don't split an idea across a boundary) (块之间有 50 个 token 的重叠，以避免将一个完整的概念拆分到边界两侧)

One chunk in our knowledge base might look like this:

我们知识库中的一个块可能如下所示：

[Chunk #4729 — Source: runbooks/billing-service.md]
"To rotate database credentials for the billing service:
1. Generate a new password in AWS Secrets Manager.
2. Update the 'billing-db' secret with the new value.
3. Trigger a rolling restart via: kubectl rollout restart deploy/billing.
4. Verify health endpoints return 200 OK.
5. Revoke the old credentials after 24h grace period."

After chunking, our 8,000 documents become maybe 120,000 chunks.

分块后，我们的 8,000 份文档可能变成了 120,000 个块。

步骤 1c —— 向量嵌入将文本转换为高维向量的过程，用于表示语义信息，便于基于相似性进行检索。

For each chunk, we call an embedding model (like BERT, OpenAI's text-embedding-3-small, or Cohere's embedder). This turns each chunk into a vector — a list of ~1,536 numbers that represents the meaning of that chunk.

对于每个块，我们调用一个嵌入模型（如 BERT、OpenAI 的 text-embedding-3-small 或 Cohere 的嵌入器）。这将每个块转换为一个向量——一个约 1,536 个数字的列表，代表了该块的含义。

Chunk #4729 → [0.12, -0.08, 0.44, ..., 0.91]   (1,536 numbers)

步骤 1d —— 向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.

We store all 120,000 of these vectors in a vector database — something like FAISS, Pinecone, Weaviate, Milvus, or Qdrant. The database indexes them so we can search across all of them in milliseconds.

我们将这 120,000 个向量全部存储在一个向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.中——例如 FAISS、Pinecone、Weaviate、Milvus 或 Qdrant。数据库对它们建立索引，以便我们能够在毫秒级时间内对所有向量进行搜索。

Indexing is done. This usually runs as a background job, and you only re-run it when documents change.

索引完成。 这通常作为后台作业运行，并且只有在文档更改时才需要重新运行。

阶段 2：检索（在查询时进行）

Now a developer types:

现在，一位开发者输入：

"How do I rotate the database credentials for the billing service?"

“如何为计费服务轮换数据库凭据？”

步骤 2a —— 用户查询

The question comes in as plain text.

问题以纯文本形式传入。

步骤 2b —— 查询嵌入

We run the same embedding model on the question, producing a query vector:

我们对问题运行相同的嵌入模型，生成一个查询向量：

Query → [0.15, -0.11, 0.48, ..., 0.87]

This is critical: you must embed the query with the same model you used to embed the chunks, otherwise the vectors live in different spaces and similarity becomes meaningless.

这一点至关重要：你必须使用与嵌入块时相同的模型来嵌入查询，否则向量将位于不同的空间中，相似性变得毫无意义。

步骤 2c —— 相似性搜索

Now we ask the vector database: "Which chunks have vectors closest to this query vector?"

现在我们询问向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.：“哪些块的向量最接近这个查询向量？”

Closeness is measured with a similarity metric, most commonly cosine similarity — it measures the angle between two vectors. The smaller the angle, the more similar the meaning.

接近程度通过相似性度量来衡量，最常用的是余弦相似度一种衡量两个向量方向相似程度的度量方法，值域为[-1, 1]，常用于文本嵌入向量的语义相似度计算。——它测量两个向量之间的夹角。夹角越小，含义越相似。

Under the hood, the database uses Approximate Nearest Neighbors (ANN) tricks to search 120,000 vectors in ~5 milliseconds instead of comparing one by one.

在底层，数据库使用近似最近邻技术，可以在约 5 毫秒内搜索 120,000 个向量，而不是逐一比较。

步骤 2d —— 相关段落

The database returns the top-k most similar chunks (typically k=3 to k=10). For our query, we might get:

数据库返回前 k 个最相似的块（通常 k=3 到 k=10）。对于我们的查询，我们可能会得到：

1. Chunk #4729 (score 0.94) — billing-service runbook, credential rotation
2. Chunk #3180 (score 0.89) — AWS Secrets Manager general guide
3. Chunk #5512 (score 0.85) — rolling restart playbook

These are the passages most likely to contain the answer.

这些是最可能包含答案的段落。

阶段 3：增强

Now we have relevant chunks, but we don't just show them to the user. We want the LLM to write a nice, synthesized answer using them.

现在我们有了相关的块，但我们不只是将它们展示给用户。我们希望大语言模型利用它们写出一个优美、综合的答案。

步骤 3a —— 原始提示

The user's raw question:

用户的原始问题：

"How do I rotate the database credentials for the billing service?"

步骤 3b —— 增强提示

We wrap it in a prompt template that injects the retrieved chunks as context:

我们将其包装在一个提示模板中，该模板将检索到的块作为上下文注入：

You are Acme Corp's internal engineering assistant.
Answer the user's question using ONLY the context below.
If the answer isn't in the context, say you don't know.

---CONTEXT---
[Chunk #4729]: To rotate database credentials for the billing
service: 1. Generate a new password in AWS Secrets Manager...
[Chunk #3180]: AWS Secrets Manager allows you to store and
rotate database credentials...
[Chunk #5512]: Rolling restarts can be triggered via kubectl
rollout restart deploy/<service-name>...
---END CONTEXT---

USER QUESTION: How do I rotate the database credentials
for the billing service?

This is the "A" in RAG — Augmentation. We're augmenting the prompt with retrieved knowledge.

这就是 RAG 中的“A”——增强。我们用检索到的知识来增强提示。

阶段 4：生成

步骤 4a —— 调用大语言模型

We send the augmented prompt to an LLM (GPT-3/4/5, Claude, LLaMA, T5, Mistral — your choice).

我们将增强后的提示发送给一个大语言模型（GPT-3/4/5、Claude、LLaMA、T5、Mistral——任选其一）。

Because the model has the actual documentation right there in its context window, it doesn't need to remember anything — it just reads and synthesizes.

因为模型就在其上下文窗口中拥有实际的文档，它不需要记住任何东西——它只需阅读并综合。

步骤 4b —— 最终响应

The model produces a grounded answer:

模型产生一个基于事实的答案：

"To rotate the database credentials for the billing service:

Generate a new password in AWS Secrets Manager.

Update the billing-db secret with the new value.

Trigger a rolling restart with kubectl rollout restart deploy/billing.

Verify that the health endpoints return 200 OK.

Revoke the old credentials after a 24-hour grace period.

Source: runbooks/billing-service.md"

“要为计费服务轮换数据库凭据：

在 AWS Secrets Manager 中生成一个新密码。

用新值更新 billing-db 密钥。

使用 kubectl rollout restart deploy/billing 触发滚动重启。

验证健康端点返回 200 OK。

在 24 小时宽限期后撤销旧凭据。

来源：runbooks/billing-service.md”

Notice three things:

注意三点：

The answer is accurate — it comes from your actual docs. (答案是准确的——它来自你的实际文档。)
The answer is current — if you update the runbook, the next query uses the new version. No retraining needed. (答案是最新的——如果你更新了操作手册，下一次查询就会使用新版本。无需重新训练。)
The answer can be cited — you know exactly which document it came from. (答案可以被引用——你可以确切知道它来自哪个文档。)

That's the whole RAG pipeline. Indexing → Retrieval → Augmentation → Generation.

这就是完整的 RAG 流程。 索引 → 检索 → 增强 → 生成。

4. 检索组件详解

Three pieces make retrieval work:

三个部分共同使检索工作：

嵌入模型

The model that turns text into vectors. Examples: BERT, text-embedding-3-small, Cohere Embed, Sentence-BERT. Choose one that's trained well for your language and domain.

将文本转换为向量的模型。例如：BERT、text-embedding-3-small、Cohere Embed、Sentence-BERT。选择一个针对你的语言和领域训练良好的模型。

向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.

Databases optimized for vector similarity search. Popular options: FAISS (local, Facebook), Pinecone (managed), Weaviate, Milvus, Qdrant, and pgvector (Postgres extension).

为向量相似性搜索优化的数据库。热门选项：FAISS（本地，Facebook）、Pinecone（托管）、Weaviate、Milvus、Qdrant 以及 pgvector（Postgres 扩展）。

相似性度量

How we measure "closeness" between vectors. The go-to is cosine similarity, but Euclidean distance and dot product also show up. Cosine similarity is popular because it ignores vector length and focuses on direction — which is what semantic meaning lives in.

我们如何衡量向量之间的“接近度”。首选是余弦相似度一种衡量两个向量方向相似程度的度量方法，值域为[-1, 1]，常用于文本嵌入向量的语义相似度计算。，但欧几里得距离和点积也常被使用。余弦相似度一种衡量两个向量方向相似程度的度量方法，值域为[-1, 1]，常用于文本嵌入向量的语义相似度计算。之所以流行，是因为它忽略向量长度，专注于方向——而语义含义就存在于方向中。

5. 增强与生成详解

提示模板

The structure that tells the LLM how to use the retrieved context. Good templates specify:

告诉大语言模型如何使用检索到的上下文的模板结构。好的模板应指定：

The assistant's role (助手的角色)
What to do if context is missing (如果上下文缺失该怎么办)
Output format (JSON, bullet points, prose) (输出格式 - JSON、项目符号、散文)
Citation rules (引用规则)

管理模型上下文

The LLM only has so much context window. If retrieval returns 30 chunks but each chunk is 500 tokens, that's 15,000 tokens just for context. You have to:

大语言模型的上下文窗口是有限的。如果检索返回 30 个块，每个块 500 个 token，那么仅上下文就占用了 15,000 个 token。你必须

常见问题（FAQ）

RAG技术如何解决大语言模型知识过时的问题？

RAG通过实时检索外部知识库（如公司文档、最新数据），在生成答案前动态获取最新信息，无需重新训练模型，确保回答基于最新知识源。

相比重新训练模型，RAG为什么成本更低？

重新训练大模型需数万至数百万美元和数周时间，且每次数据更新都要重复。RAG仅需一次性建立索引，后续通过检索动态获取信息，成本极低且实时更新。

RAG如何保证生成答案的准确性和可追溯性？

RAG将生成过程与检索结合，答案直接来源于可验证的外部文档（如产品手册、法律文件），提供可引用和审计的来源，避免模型幻觉，特别适合金融、医疗等受监管领域。

AI Summary (BLUF)