RAG 实战指南：从概念到生产部署的完整路径

这不是一篇给你讲概念的文章。

This is not an article that merely explains concepts to you.

这是一份让你看完就能动手，少走半年弯路的实战指南。

This is a practical guide that you can start implementing immediately after reading, helping you avoid six months of detours.

为什么你必须搞懂 RAG

Why You Must Understand RAG

2023 年是大模型“百模大战”年，所有人都在刷榜单、比参数。2024 年起，战场转移了——谁能把大模型真正用起来，谁才有价值。

2023 was the year of the “Hundred Models Battle,” with everyone focused on leaderboards and parameter counts. Starting in 2024, the battlefield has shifted—those who can truly put large models to use are the ones who create value.

而检索增强生成（RAG，Retrieval-Augmented Generation），就是这场“应用落地战”里最核心的武器。RAG能让大模型在生产环境真正用起来！！！

And Retrieval-Augmented Generation (RAG) is the most critical weapon in this “application deployment battle.” RAG enables large models to be truly usable in production environments!!!

不夸张地说：没有 RAG 打底，一切 AI 应用都是 PPT。

It’s no exaggeration to say: Without RAG as a foundation, all AI applications are just PowerPoint presentations.

你可能在无数地方见过 RAG 这个词，但很多讲解要么只停在“向量检索+大模型生成”这层皮，要么铺天盖地的英文论文让人望而却步。这篇文章的目标只有一个：让你真正搞懂 RAG，并且能落地。

You may have encountered the term RAG in countless places, but many explanations either stop at the surface level of “vector retrieval + LLM generation” or are intimidating with overwhelming English papers. The goal of this article is singular: to help you truly understand RAG and be able to implement it.

文章结构如下：

The article is structured as follows:

RAG 是什么，为什么需要它 (What is RAG, and why do we need it?)
RAG 技术的发展迭代历程 (The evolution and development of RAG technology)
落地时如何做技术选型 (How to make technical choices during implementation)
业界当前的经典实践 (Current classic practices in the industry)
RAG 未来的发展方向 (Future directions for RAG)
从零到一的 RAG 实战落地路径 (A practical path for implementing RAG from zero to one)

全文约 12000 字，干货优先，代码和图表穿插，一次读完，够用一年。

The full article is approximately 12,000 words, prioritizing substance, interspersed with code and charts. Read it once, and it will serve you for a year.

第一章：RAG 是什么，为什么需要它？

Chapter 1: What is RAG, and Why Do We Need It?

1.1 从一个真实的痛点说起

1.1 Starting from a Real Pain Point

你公司买了 GPT-4 API 权限，花了两周做了一个“企业智能客服”——把公司所有产品文档喂进去，用户提问，AI 作答。

Your company purchased GPT-4 API access and spent two weeks building an “enterprise intelligent customer service” system—feeding all company product documentation into it, where users ask questions and the AI answers.

演示很完美。上线第一天，用户来问：

The demo was perfect. On the first day of launch, a user asks:

“你们最新出的 Pro 版本，和去年的 Basic 版本相比，具体差在哪里？”

“What are the specific differences between your latest Pro version and last year’s Basic version?”

AI 答得头头是道。可你看完之后发现——它在瞎说。

The AI answers convincingly. But after reading it, you realize—it’s talking nonsense.

因为 GPT-4 根本不知道你们公司存在，更不知道你们有什么产品。 它给出的答案完全是根据训练数据“编”出来的。

Because GPT-4 has no knowledge of your company’s existence, let alone your products. The answers it provides are entirely “fabricated” based on its training data.

这就是 大模型的两大致命缺陷：

These are the two fatal flaws of large models:

① 知识截止（Knowledge Cutoff）大语言模型训练数据截止到某个时间点，无法获取该时间点之后的新信息。

① Knowledge Cutoff

所有大模型都有训练截止日期。GPT-4 的训练数据截止到某个时间点，之后发生的事情它一概不知。你公司上个月发布的新产品，它当然不知道。

All large models have a training cutoff date. GPT-4’s training data ends at a certain point in time; it knows nothing about events after that. It certainly doesn’t know about the new product your company released last month.

② 幻觉（Hallucination）AI模型生成看似合理但实际不存在或错误信息的问题，是早期AI搜索工具的主要缺陷。

② Hallucination

幻觉就是大模型生成看似合理但实际是错误的回答，是大模型在 “一本正经地胡说八道”。大模型是在海量数据上训练出来的玩“文字接龙”的概率预测机器，大模型没有思想，只是在做极致的数学计算。当它被问到不知道的事情时，它不会说“我不知道”，而是会“合情合理地编造”一个听起来像真的答案。这个问题在专业领域里会造成严重后果。

Hallucination refers to the large model generating seemingly reasonable but actually incorrect answers. It’s the model “speaking nonsense with a straight face.” Large models are probability-prediction machines trained on massive data to play “word chain” games; they have no thoughts, only perform extreme mathematical calculations. When asked about something they don’t know, they don’t say “I don’t know”; instead, they “plausibly fabricate” an answer that sounds true. This issue can have serious consequences in professional fields.

那能不能把知识喂进去训练？

Can we feed knowledge into it for training?

理论上可以，但：

Theoretically yes, but:

重新微调一个大模型，费用从几万到几百万不等； (Fine-tuning a large model from scratch can cost anywhere from tens of thousands to millions.)
你的文档每天都在更新，不可能每次更新都去重训； (Your documents are updated daily; it’s impossible to retrain every time.)
训练完的知识“固化”在权重里，之后依然存在知识截止问题。 (The trained knowledge becomes “solidified” in the weights, and the knowledge cutoff problem persists afterward.)

RAG 就是来解决这两个问题的。

RAG is designed to solve these two problems.

1.2 RAG 的核心思路

1.2 The Core Idea of RAG

RAG 的核心思路极其简单，用一句话概括：

The core idea of RAG is extremely simple and can be summarized in one sentence:

在让大模型作答之前，先去外部知识库找到相关信息，然后把这些信息连同问题一起交给大模型。

Before letting the large model answer, first go to an external knowledge base to find relevant information, then provide this information along with the question to the large model.

用生活化的比喻来说：

Using a life analogy:

你去参加一场开卷考试，不需要把所有知识背进脑子里——你只需要知道去哪里找，以及如何把找到的内容用在答案上。

When you take an open-book exam, you don’t need to memorize all the knowledge—you just need to know where to find it and how to use what you find in your answer.

RAG 里的大模型就是那个能看懂资料、组织语言作答的“学生”，而外部知识库就是那本“参考书”。

In RAG, the large model is the “student” who can understand the materials and organize language to answer, while the external knowledge base is the “reference book.”

RAG 全称 Retrieval-Augmented Generation，直译是“检索增强生成”，三个词对应三个步骤：

RAG stands for Retrieval-Augmented Generation, literally translated as “retrieval-augmented generation.” The three words correspond to three steps:

用户提问
   │
   ▼
[Retrieval 检索] → 去知识库里找相关文档片段
   │
   ▼
[Augmentation 增强] → 把找到的内容拼到 Prompt 里
   │
   ▼
[Generation 生成] → 大模型根据上下文生成答案

User asks a question
   │
   ▼
[Retrieval] → Find relevant document snippets in the knowledge base
   │
   ▼
[Augmentation] → Incorporate the found content into the Prompt
   │
   ▼
[Generation] → The large model generates an answer based on the context

1.3 RAG 解决了什么，没解决什么

1.3 What RAG Solves and What It Doesn't

RAG 解决的问题：

Problems RAG Solves:

✅ 知识时效性：外部知识库随时可更新，不需要重训模型 (Knowledge Timeliness: The external knowledge base can be updated at any time without retraining the model.)
✅ 幻觉抑制：答案有“依据”可查，减少无依据编造 (Hallucination Mitigation: Answers have “evidence” to reference, reducing unfabricated responses.)
✅ 私有知识接入：企业内部文档、专有数据可安全接入 (Private Knowledge Integration: Internal corporate documents and proprietary data can be securely integrated.)
✅ 可追溯性：答案可以附上来源链接，用户可自行核实 (Traceability: Answers can include source links, allowing users to verify independently.)
✅ 成本可控：无需重训大模型，只需维护知识库 (Controllable Cost: No need to retrain large models, only maintain the knowledge base.)

RAG 没有解决的问题：

Problems RAG Does Not Solve:

❌ 复杂推理：需要多步逻辑推导的问题，基础 RAG 依然力不从心 (Complex Reasoning: Problems requiring multi-step logical deduction are still challenging for basic RAG.)
❌ 极致实时性：入库、索引构建存在一定延迟 (Extreme Real-time Requirements: There is inherent delay in ingestion and index building.)
❌ 跨文档关联推理：“A 和 B 两个文档里的信息联合说明了什么”这类问题，基础 RAG 效果较差 (Cross-document Associative Reasoning: Questions like “What do the combined pieces of information from documents A and B imply?” are less effective with basic RAG.)

这些问题是 Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。和 Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。要解决的，我们后面会讲。

These issues are addressed by Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。 and Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。, which we will discuss later.

第二章：RAG 技术的发展迭代

Chapter 2: The Evolution of RAG Technology

RAG 技术从提出到今天，经历了清晰可辨的五代演进。

From its proposal to today, RAG technology has undergone five distinct generations of evolution.

2.1 第一代：概念诞生（2020 年）

2.1 First Generation: Conceptual Birth (2020)

RAG 这个词最早由 Facebook AI Research 在 2020 年的论文

The term RAG was first explicitly proposed by Facebook AI Research in their 2020 paper

《Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks》 里明确提出。

《Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks》.

这篇论文里的 RAG 和我们今天用的有本质区别：它是端到端可训练的。检索器和生成器是一个整体，用联合训练的方式来优化。

The RAG in this paper is fundamentally different from what we use today: it was end-to-end trainable. The retriever and generator were a single unit, optimized through joint training.

当时这个架构的问题很明显：

The problems with this architecture at the time were evident:

训练成本高，工程难度大 (High training cost, significant engineering difficulty.)
需要有标注数据才能训练检索器 (Required labeled data to train the retriever.)
无法直接使用“现成大模型”，必须联合训练 (Could not directly use “off-the-shelf large models”; joint training was mandatory.)

所以这一代 RAG 主要停留在学术圈，没有大规模落地。

Therefore, this generation of RAG remained primarily in academia and was not widely deployed.

2.2 第二代：范式确立（2022–2023 年）

2.2 Second Generation: Paradigm Establishment (2022–2023)

ChatGPT 的爆火是一个分水岭。大量企业迫切需要把大模型用起来，但又面临“幻觉”和“知识时效”两大问题。

The explosive popularity of ChatGPT was a watershed moment. A large number of enterprises urgently needed to utilize large models but faced the twin problems of “hallucination” and “knowledge timeliness.”

这时候，一种更务实的 RAG 范式出现了：

At this point, a more pragmatic RAG paradigm emerged:

不做联合训练，直接用 Prompt Engineering 把检索结果塞进上下文。

No joint training; directly use Prompt Engineering to insert retrieval results into the context.

这一代 RAG 的架构变成了松散耦合的两个独立组件：

The architecture of this generation of RAG became two loosely coupled, independent components:

检索器：负责找相关内容，通常是向量数据库 + Embedding 模型 (Retriever: Responsible for finding relevant content, typically a vector database + Embedding model.)
生成器：任意大模型（GPT-4、Claude 等），通过 Prompt 输入检索结果 (Generator: Any large model (GPT-4, Claude, etc.), receiving retrieval results via the Prompt.)

这个范式彻底降低了门槛。LangChain、LlamaIndex 等框架的出现，让“5 分钟搭一个 RAG demo”成为可能。

This paradigm drastically lowered the barrier to entry. The emergence of frameworks like LangChain and LlamaIndex made “building a RAG demo in 5 minutes” possible.

2023 年是 RAG 的“野蛮生长年”，每家公司都在搭自己的知识库问答，大量 RAG 应用上线。

2023 was the “year of wild growth” for RAG. Every company was building its own knowledge base Q&A, and a large number of RAG applications went live.

但很快大家发现：Demo 效果好，生产效果差。 这催生了对 RAG 的深度优化需求。

But soon, everyone realized: Demo performance was good, but production performance was poor. This spurred the demand for deep optimization of RAG.

2.3 第三代：Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。（2023–2024 年）

2.3 Third Generation: Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。 (2023–2024)

研究者和工程师开始分析 RAG 失效的原因，总结出核心问题出在三个环节：

Researchers and engineers began analyzing the reasons for RAG failures, concluding that the core problems lay in three stages:

① 检索前（Pre-Retrieval）问题

① Pre-Retrieval Problems

用户提问本身质量差，导致检索出错 (Poor quality user queries leading to retrieval errors.)
歧义表达、口语化表达导致语义匹配失败 (Ambiguous or colloquial expressions causing semantic matching failures.)

② 检索中（During Retrieval）问题

② During-Retrieval Problems

文本切分（Chunking）将长文档分割成较小片段的过程，是RAG数据预处理的关键步骤，直接影响检索质量。策略不当，把关键信息切断 (Inappropriate text chunking strategies, cutting off key information.)
纯向量检索对精确匹配词（人名、代号、型号）效果差 (Pure vector retrieval performs poorly on exact match terms (names, codes, model numbers).)

③ 检索后（Post-Retrieval）问题

③ Post-Retrieval Problems

召回内容过多，把重要信息“淹没” (Too much retrieved content, “drowning out” important information.)
没有对召回结果做质量过滤 (No quality filtering of retrieved results.)

Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。针对这三个环节提出了对应优化：

Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。 proposed corresponding optimizations for these three stages:

Pre-Retrieval 优化：

Pre-Retrieval Optimizations:

Query Rewriting（查询改写）：用大模型把模糊问题改写成检索友好格式 (Query Rewriting: Use a large model to rewrite vague questions into a retrieval-friendly format.)
Query Expansion（查询扩展）：一个问题扩展成多个角度子问题，提升召回率 (Query Expansion: Expand one question into multiple sub-questions from different angles to improve recall.)
HyDE（假设文档嵌入）：先让大模型“假设”一个答案，用假设答案去检索 (HyDE (Hypothetical Document Embeddings): First, let the large model “hypothesize” an answer, then use the hypothetical answer for retrieval.)

During Retrieval 优化：

During-Retrieval Optimizations:

混合检索（Hybrid Search）结合向量检索（语义相似度）和关键词检索（如BM25）的搜索方法，提高召回准确率。：向量检索（语义）+ BM25（关键词）并行 (Hybrid Search: Vector retrieval (semantic) + BM25 (keyword) in parallel.)
Chunk 策略优化：小块检索、大块喂给 LLM (Chunking Strategy Optimization: Small chunks for retrieval, large chunks fed to the LLM.)
父文档检索（Parent Document Retrieval）：细粒度定位，粗粒度返回上下文 (Parent Document Retrieval: Fine-grained for location, coarse-grained for returning context.)

Post-Retrieval 优化：

Post-Retrieval Optimizations:

Re-ranking（重排序）：用 Cross-Encoder 精细打分，提升 Top-K 质量 (Re-ranking: Use Cross-Encoder for fine-grained scoring to improve Top-K quality.)
上下文压缩（Context Compression）：剔除无关冗余，减轻 LLM 上下文压力 (Context Compression: Remove irrelevant redundancy to reduce LLM context pressure.)

这一阶段 RAG 效果显著提升，但系统复杂度也大幅增加。

In this phase, RAG effectiveness improved significantly, but system complexity also increased substantially.

2.4 第四代：Modular RAG（2024 年）

2.4 Fourth Generation: Modular RAG (2024)

随着 Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。组件越来越多，研究者开始思考一个更高层次的问题：

As Advanced RAG在Naive RAG基础上引入检索前优化（如数据粒度增强、索引结构优化）和检索后处理（如重排序、提示压缩）的改进框架。 components proliferated, researchers began considering a higher-level question:

不同查询场景，需要的 RAG 流程不同。能不能让 RAG 流程动态可配置？

Different query scenarios require different RAG workflows. Can the RAG workflow be made dynamically configurable?

Modular RAG 的思路是：把每个 RAG 环节抽象成独立模块，根据查询类型、数据源动态组合。

The idea of Modular RAG is: abstract each RAG stage into an independent module and dynamically combine them based on query type and data source.

核心组件拆分：

Core component breakdown:

Search Module：向量、关键词、知识图谱、SQL 查询 (Search Module: Vector, keyword, knowledge graph, SQL query.)
Memory Module：短期上下文记忆、长期知识存储 (Memory Module: Short-term context memory, long-term knowledge storage.)
Fusion Module：多路召回结果融合 (Fusion Module: Fusion of multi-path retrieval results.)
Routing Module：根据查询类型路由到不同检索策略 (Routing Module: Routes to different retrieval strategies based on query type.)
Predict Module：子问题拆分与迭代检索 (Predict Module: Sub-question splitting and iterative retrieval.)

这个架构更灵活，更像一个“平台”而不是一条“流水线”。

This architecture is more flexible, resembling a “platform” rather than a single “pipeline.”

2.5 第五代：Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。（2025 年起）

2.5 Fifth Generation: Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。 (Starting 2025)

更进一步的演化：把 RAG 流程里的控制权交给大模型自己决策。

A further evolution: Hand over control within the RAG workflow to the large model for its own decision-making.

Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。就是让智能体（Agent）自主思考、规划、调用工具，代替固定流程去完成检索、推理、纠错，最终更聪明地回答复杂问题的 RAG。

Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。 is RAG that allows an intelligent agent to autonomously think, plan, and call tools, replacing fixed workflows to perform retrieval, reasoning, error correction, and ultimately answer complex questions more intelligently.

传统 RAG 是固定单次检索流程：检索一次 → 生成。

Traditional RAG is a fixed single-retrieval process: retrieve once → generate.

Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。让大模型能够：

Agentic RAG一种增强的检索增强生成技术，使 AI 智能体能够主动、动态地检索外部知识以完成任务。 enables the large model to:

判断当前召回内容是否足够 (Judge whether the current retrieved content is sufficient.)
决定是否需要多轮检索（多跳检索） (Decide whether multiple rounds of retrieval (multi-hop retrieval) are needed.)
选择从哪个数据源检索 (Choose which data source to retrieve from.)
评估生成答案是否可靠 (Evaluate whether the generated answer is reliable.)

本质是 Agent 推理能力 + RAG 知识检索能力 的结合。这是 RAG 当前最前沿方向，第五章详细展开。

The essence is the combination of Agent reasoning capability + RAG knowledge retrieval capability. This is the current cutting-edge direction for RAG, detailed in Chapter 5.

第三章：落地 RAG 时的技术选型

Chapter 3: Technical Selection for Implementing RAG

很多人做 RAG 技术选型时犯了同一个错误：把选型当成收集“最强组件”的游戏，最终搭出臃肿系统，效果不升反降。

Many people make the same mistake during RAG technical selection: treating selection as a game of collecting the “strongest components,” ultimately building a bloated system where effectiveness decreases instead of improving.

技术选型核心原则：匹配场景，简单优先。

Core Principle of Technical Selection: Match the scenario, prioritize simplicity.

下面逐层拆解每个环节选型要点。

Below, we break down the key points for selection at each stage.

3.1 文档解析层

3.1 Document Parsing Layer

为什么重要： 数据工程是 RAG 效果的天花板。内容解析得差，后面怎么优化都是填坑。

Why it’s important: Data engineering is the ceiling for RAG effectiveness. Poor content parsing means any subsequent optimization is just filling holes.

主要挑战：

Main Challenges:

PDF 里的表格、多栏布局、图片处理 (Tables, multi-column layouts, image processing in PDFs.)
扫描版 PDF 需要 OCR (Scanned PDFs require OCR.)
Word、PPT、网页等多格式统一处理 (Unified processing of multiple formats like Word, PPT, web pages.)

工具选型：

Tool Selection:

工具 (Tool)	核心特点 (Core Features)	适用场景 (Applicable Scenarios)
PyMuPDF	轻量快速，纯文本提取准确 (Lightweight, fast, accurate pure text extraction.)	文字版 PDF，快速上手 (Text-based PDFs, quick start.)
Docling	支持 GPU，表格/图表识别强 (GPU support, strong table/chart recognition.)	复杂排版，生产环境 (Complex layouts, production environment.)
Unstructured	格式支持最广（20+ 种） (Widest format support (20+ types).)	多格式混合文档库 (Mixed-format document libraries.)
LlamaParse	云服务，专为 RAG 优化 (Cloud service, optimized for RAG.)	不想自建解析基础设施 (Don't want to build parsing infrastructure.)
MinerU	中文支持好，开源免费 (Good Chinese support, open-source and free.)	中文文档为主的场景 (Scenarios dominated by Chinese documents.)
pdfplumber	轻量、精准，表格提取极强，可定位坐标，纯 Python (Lightweight, precise, excellent table extraction, coordinate positioning, pure Python.)	文字版 PDF、精准表格抽取、无需复杂排版 (Text-based PDFs, precise table extraction, no complex layout needed.)

实践建议：

Practical Advice:

先用最简单工具跑通，再根据问题针对性升级 (Start with the simplest tool to get it working, then upgrade based on specific problems.)
表格是解析难点：大表格拆成“属性-值对”单独存储效果更好 (Tables are a parsing challenge: splitting large tables into “attribute-value pairs” for separate storage works better.)
自建正则清理逻辑，去掉页眉页脚、目录页码等噪声 (Build custom regex cleaning logic to remove noise like headers, footers, table of contents, page numbers.)

3.2 文本切分层（Chunking）

3.2 Text Chunking Layer (Chunking)

这是被低估最严重的环节。 切分策略直接决定检索质量，没有“万能大小”，只有“适合场景的策略”。

This is the most severely underestimated stage. The chunking strategy directly determines retrieval quality. There is no “one-size-fits-all” size, only “strategies suitable for the scenario.”

常见策略对比：

Comparison of Common Strategies:

① 固定大小切分（Fixed-size Chunking）

① Fixed-size Chunking

按 Token/字符截断，可设重叠窗口 (Truncate by Token/character, can set an overlap window.)
优点：简单，索引高效 (Advantages: Simple, efficient indexing.)
缺点：容易在关键信息处截断 (Disadvantages: Prone to cutting off key information.)
参考：300–512 Token，50–100 Token 重叠 (Reference: 300–512 Tokens, 50–100 Token overlap.)

② 语义切分（Semantic Chunking）

② Semantic Chunking

基于句子嵌入相似度，在语义断点切割 (Cuts at semantic breakpoints based on sentence embedding similarity.)
优点：保持语义完整性 (Advantages: Maintains semantic integrity.)
缺点：计算成本高，结果不均 (Disadvantages: High computational cost, uneven results.)

③ 递归结构切分（Recursive Split）

③ Recursive Split

先按段落、再按句子、再按字符递归切分 (Recursively splits first by paragraph, then by sentence, then by character.)
LangChain RecursiveCharacterTextSplitter 代表 (Represented by LangChain's RecursiveCharacterTextSplitter.)
适合大多数通用场景 (Suitable for most general scenarios.)

④ 文档感知切分（Document-aware Chunking）

④ Document-aware Chunking

Markdown 按标题层级切分 (Markdown split by heading hierarchy.)
代码按函数/类切分 (Code split by function/class.)
根据文档结构而非纯文本切分 (Chunking based on document structure rather than pure text.)

⑤ 父子 Chunk（Parent-Child Chunking）

⑤ Parent-Child Chunking

小 Chunk（128 Token）用于精确检索 (Small Chunks (128 Tokens) for precise retrieval.)
大 Chunk（512–1024 Token）用于给 LLM 提供上下文 (Large Chunks (512–1024 Tokens) for providing context to the LLM.)
检索用小 Chunk 定位，返回对应大 Chunk (Use small chunks for retrieval location, return the corresponding large chunk.)

推荐策略：

Recommended Strategies:

入门：固定大小 + 重叠（300 Token，50 Token 重叠） (Beginner: Fixed-size + overlap (300 Tokens, 50 Token overlap).)
进阶：父子 Chunk (Advanced: Parent-Child Chunking.)
复杂文档：文档感知切分 (Complex documents: Document-aware chunking.)

3.3 Embedding 模型选型

3.3 Embedding Model Selection

Embedding 模型负责把文本转成向量，是语义搜索核心。

The Embedding model is responsible for converting text into vectors and is the core of semantic search.

主要评估维度：

Main Evaluation Dimensions:

语义表征能力（MTEB 榜单） (Semantic representation capability (MTEB leaderboard).)
支持最大 Token 长度 (Maximum supported Token length.)
中文支持 (Chinese language support.)
推理速度与成本 (Inference speed and cost.)
是否支持本地部署 (Whether it supports local deployment.)

主流选型：

Mainstream Options:

模型 (Model)	部署方式 (Deployment)	向量维度 (Dim.)	特点与适用场景 (Features & Applicable Scenarios)
text-embedding-3-large	API	3072	OpenAI，英文强，成本低 (OpenAI, strong in English, low cost.)
text-embedding-3-small	API	1536	性价比高，轻量任务首选 (High cost-performance, preferred for lightweight tasks.)
BGE-M3	开源/本地 (Open-source/Local)	1024	中英双语强，支持密集+稀疏+多向量 (Strong in Chinese and English, supports dense+sparse+multi-vector.)
BGE-large-zh	开源/本地 (Open-source/Local)	1024	中文专项优化 (Specifically optimized for Chinese.)
Jina Embeddings v3	API/本地 (API/Local)	1024	多语言，支持长文本 (Multilingual, supports long text.)
nomic-embed-text	开源 (Open-source)	768	轻量高效，本地部署友好 (Lightweight and efficient, friendly for local deployment.)
m3e-base/m3e-large	开源/本地 (Open-source/Local)	768/1024	国产中文专属，效果稳、速度快，社区常用 (Domestic Chinese-specific, stable performance, fast speed, commonly used in the community.)

选型建议：

Selection Advice:

中文场景：优先 BGE-M3 / BGE-large-zh (Chinese scenarios: Prioritize BGE-M3 / BGE-large-zh.)
纯 API 不想自建：text-embedding-3-small (Pure API, don't want to self-host: text-embedding-3-small.)
数据保密要求高：本地部署 BGE 系列 (High data confidentiality requirements: Local deployment of BGE series.)

3.4 向量数据库选型

3.4 Vector Database Selection

向量数据库负责存储向量并高效执行相似度搜索。

The vector database is responsible for storing vectors and efficiently performing similarity searches.

选型前先想清楚：

Think clearly before selection:

数据量级：百万级以内还是以上？ (Data scale: Under a million or above?)
更新频率：静态库还是实时更新？ (Update frequency: Static library or real-time updates?)
是否需要向量 + 标量过滤混合查询？ (Is hybrid querying with vector + scalar filtering needed?)
有无运维能力？ (Do you have operational capabilities?)
预算是否支持云服务？ (Does the budget support cloud services?)

主流向量数据库对比：

Comparison of Mainstream Vector Databases:

数据库 (Database)	部署方式 (Deployment)	核心特点 (Core Features)	适用场景 (Applicable Scenarios)
Milvus	自建/云服务 (Self-hosted/Cloud)	功能最全，性能强，企业级 (Most comprehensive features, strong performance, enterprise-grade.)	大规模生产环境 (Large-scale production environment.)
Weaviate	自建/云服务 (Self-hosted/Cloud)	GraphQL 接口，模块化 (GraphQL interface, modular.)	复杂查询、多模态 (Complex queries, multimodal.)
Qdrant	自建/云服务 (Self-hosted/Cloud)	Rust 编写，高性能，支持过滤 (Written in Rust, high performance, supports filtering.)	高性能要求，中小规模 (High-performance requirements, small to medium scale.)
Chroma	本地嵌入 (Local Embedded)	简单友好，无需独立服务 (Simple and friendly, no independent service needed.)	原型开发、小规模 (Prototype development, small scale.)
FAISS	库（非服务） (Library (not a service))	Meta，高性能，无持久化 (Meta, high performance, no persistence.)	学习、小项目、自定义封装 (Learning, small projects, custom packaging.)
pgvector	PostgreSQL 扩展 (PostgreSQL Extension)	无需新技术栈，与 PG 深度集成 (No new tech stack needed, deep integration with PG.)	已有 PostgreSQL 基础设施 (Existing PostgreSQL infrastructure.)
Pinecone	全托管云服务 (Fully Managed Cloud)	零运维，无限扩展 (Zero operations, unlimited scaling.)	不想运维，快速上线 (Don't want to operate, fast deployment.)
Elasticsearch / OpenSearch	自建 / 云服务 (Self-hosted / Cloud)	成熟生态，全文检索 + 向量检索一体，插件化（k-NN），社区极大 (Mature ecosystem, full-text + vector search integrated, plugin-based (k-NN), huge community.)	已有 ES/OpenSearch 业务，混合文本 + 向量检索，企业级搜索 (Existing ES/OpenSearch business, hybrid text + vector search, enterprise search.)

选型路径：

Selection Path:

快速验证/个人项目：Chroma / FAISS (Quick validation/personal projects: Chroma / FAISS.)
已有 PG：pgvector (Already have PG: pgvector.)
中小企业自建：Qdrant / Milvus Lite (SME self-hosting: Qdrant / Milvus Lite.)
大规模企业：Milvus + 云部署 (Large-scale enterprise: Milvus + cloud deployment.)
不想运维：Pinecone (Don't want to operate: Pinecone.)

3.5 LLM 选型

3.5 LLM Selection

LLM 是“大脑”，RAG 是“外接记忆”。LLM 选型主要看：

The LLM is the “brain,” RAG is the “external memory.” LLM selection mainly depends on:

核心需求 (Core Requirement)	推荐模型 (Recommended Model)
最强效果，不计成本 (Best performance, regardless of cost.)	GPT-4o / Claude 3.5 Sonnet
效果成本平衡 (Performance-cost

常见问题（FAQ）

RAG技术具体解决了大模型的哪些核心问题？

RAG主要解决大模型的知识截止和幻觉问题。它通过检索外部最新知识来弥补模型训练数据的时效性不足，并提供事实依据来减少模型“一本正经胡说八道”的情况。

What core issues of large models does RAG technology specifically address?
RAG primarily tackles the problems of knowledge cutoff and hallucinations in large models. It compensates for the timeliness limitations of model training data by retrieving the latest external knowledge and provides factual grounding to reduce instances of the model "confidently spouting nonsense."

RAG技术从2020年到现在经历了怎样的发展？

RAG技术已演进至第五代。第一代于2020年概念诞生，第二代在2022-2023年确立了基本范式，后续几代持续优化检索精度、生成质量与系统效率，推动技术走向成熟落地。

How has RAG technology evolved from 2020 to the present?
RAG technology has progressed to its fifth generation. The first generation emerged conceptually in 2020, the second established the fundamental paradigm in 2022–2023, and subsequent generations have continuously optimized retrieval accuracy, generation quality, and system efficiency, driving the technology toward maturity and practical implementation.

在企业中实施RAG有哪些关键步骤？

企业实施RAG需遵循完整路径：先理解核心概念与价值，再进行技术选型，参考业界经典实践，最后规划从零到一的部署方案。这能帮助团队避免弯路，真正让大模型在生产环境中创造价值。

What are the key steps for implementing RAG in enterprises?
Implementing RAG in enterprises requires following a comprehensive path: first, understand the core concepts and value; then, proceed with technology selection; refer to industry best practices; and finally, plan a deployment strategy from scratch. This approach helps teams avoid detours and truly enables large models to create value in production environments.

RAG如何解决大语言模型的知识截止和幻觉问题？2026年企业应用实战指南

AI Summary (BLUF)