RAG技术如何解决LLM的知识截止和幻觉问题？2026年主流架构详解

Q: RAG的在线查询流程具体包含哪些步骤？

在线流程包括：将用户问题转为查询向量，在向量数据库中检索最相似的文档块（Top-K），将问题和检索结果拼接成Prompt，最后输入给LLM生成最终答案。

RAG（Retrieval-Augmented Generation，检索增强生成）是目前最主流的 LLM 落地架构之一。

RAG（检索增强生成）结合信息检索和文本生成的技术，通过检索相关文档来增强大型语言模型的生成能力。是当前将大型语言模型（LLM）投入实际应用的主流架构之一。

RAG 的核心思想是：让 LLM 在回答问题时，先从外部知识库中检索相关内容，再基于检索结果生成回答，而不是仅依赖模型训练时记住的知识。

RAG 的核心思想是：让 LLM 在回答问题时，首先从外部知识库中检索相关内容，然后基于检索结果生成答案，而非仅仅依赖模型在训练时记住的知识。

这解决了 LLM 的两个核心痛点：知识截止日期（模型不知道训练后发生的事）和幻觉问题（模型在不确定时会编造答案）。

这解决了 LLM 的两个核心痛点：知识截止日期（模型不了解训练后发生的事件）和幻觉问题（模型在不确定时会编造答案）。

RAG 基础原理

一个完整的 RAG 系统由两条流水线组成：离线索引流水线（将文档预处理存入向量库）和在线查询流水线（接收用户问题、检索、生成）。

一个完整的 RAG 系统包含两条流水线：离线索引流水线（用于将文档预处理并存入向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.）和在线查询流水线（用于接收用户查询、进行检索并生成答案）。

离线阶段将原始文档切分成小块，通过 Embedding 模型转换为向量，存入向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.。

在离线阶段，原始文档被切分成小块，通过 Embedding 模型转换为向量表示，并存储到向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.中。

在线阶段将用户问题同样转换为向量，从数据库中找到最相近的文档块，拼接成上下文交给 LLM 生成答案。

在线阶段，用户问题同样被转换为向量，系统从数据库中检索出最相似的文档块，将其与问题拼接成上下文，然后交给 LLM 生成最终答案。

下图展示了 RAG 的完整请求流程：

下图展示了 RAG 的完整请求流程：

数据预处理与文档切分（Chunking）

前置挑战：复杂文档解析

在进行切分前，RAG 往往面临着格式解析的挑战。特别是 PDF、Word 或扫描件中的表格、图片和多栏排版，普通的文本提取极易造成语义错乱。

在文档切分之前，RAG 系统通常面临格式解析的挑战。尤其是 PDF、Word 文档或扫描件中的表格、图片和多栏排版，简单的文本提取极易导致语义混乱。

目前行业主流方案是引入 文档解析引擎（如 LlamaParse、Unstructured）或多模态大模型，将复杂图文转换为结构化的 Markdown，为后续高质量切分打下基础。

目前行业的主流解决方案是引入文档解析引擎（如 LlamaParse、Unstructured）或多模态大模型，将复杂的图文内容转换为结构化的 Markdown 格式，为后续高质量的文档切分奠定基础。

文档切分策略

文档切分是 RAG 效果的基础，切分粒度直接影响检索质量。块太大会引入噪声，块太小会丢失上下文。常用策略如下：

文档切分是 RAG 系统效果的基础，切分的粒度直接影响检索质量。块太大会引入噪声，块太小则会丢失上下文。常用的切分策略如下：


切分策略	适用场景	优点	缺点
固定大小切分	通用文本	实现简单，速度快	可能切断语义完整的句子
递归字符切分	结构化文本（Markdown、代码）	优先按段落、句子等语义边界切分	实现略复杂，需设定合理的分隔符列表
语义切分 (Semantic)	长文档、书籍	利用 Embedding 计算相邻句子的相似度，自动寻找语义转折点切分	计算成本高，预处理速度慢
父子文档检索 (Small-to-Big)	全面覆盖场景	用"小块"进行高精度向量检索，命中后返回对应的"大块"（父文档）给 LLM，兼顾了检索精度和上下文完整性。	数据库设计和维护成本翻倍

Chunking Strategy Applicable Scenarios Advantages Disadvantages

Fixed-size Chunking General text Simple implementation, fast May cut off semantically complete sentences

Recursive Character Chunking Structured text (Markdown, code) Prioritizes splitting at semantic boundaries like paragraphs and sentences Slightly complex implementation, requires a reasonable list of separators

Semantic Chunking Long documents, books Uses embeddings to calculate similarity between adjacent sentences, automatically finds semantic turning points for splitting High computational cost, slow preprocessing speed

Parent-Child Document Retrieval (Small-to-Big) Comprehensive coverage scenarios Uses "small chunks" for high-precision vector retrieval, returns corresponding "large chunks" (parent documents) to LLM upon hit, balancing retrieval precision and context integrity. Database design and maintenance costs double


Chunking Strategy	Applicable Scenarios	Advantages	Disadvantages
Fixed-size Chunking	General text	Simple implementation, fast	May cut off semantically complete sentences
Recursive Character Chunking	Structured text (Markdown, code)	Prioritizes splitting at semantic boundaries like paragraphs and sentences	Slightly complex implementation, requires a reasonable list of separators
Semantic Chunking	Long documents, books	Uses embeddings to calculate similarity between adjacent sentences, automatically finds semantic turning points for splitting	High computational cost, slow preprocessing speed
Parent-Child Document Retrieval (Small-to-Big)	Comprehensive coverage scenarios	Uses "small chunks" for high-precision vector retrieval, returns corresponding "large chunks" (parent documents) to LLM upon hit, balancing retrieval precision and context integrity.	Database design and maintenance costs double

实践中常在切分时加入 重叠（overlap），即相邻块之间共享若干字符，防止重要信息在边界处被截断。典型配置：块大小 512 tokens，重叠 50~100 tokens。

In practice, overlap is often added during chunking, meaning adjacent chunks share a certain number of characters to prevent important information from being truncated at boundaries. Typical configuration: chunk size 512 tokens, overlap 50~100 tokens.

实例：使用 LangChain 进行递归切分

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # 每块最大 token 数
    chunk_overlap=50,      # 相邻块的重叠 token 数，防止信息在边界处丢失
    separators=["\n\n", "\n", "。", ".", " ", ""]  # 优先按段落、句子切分
)

chunks = splitter.split_text(document_text)
print(f"切分为 {len(chunks)} 个文档块")

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # Maximum tokens per chunk
    chunk_overlap=50,      # Overlap tokens between adjacent chunks to prevent information loss at boundaries
    separators=["\n\n", "\n", "。", ".", " ", ""]  # Prioritize splitting by paragraphs, sentences
)

chunks = splitter.split_text(document_text)
print(f"Split into {len(chunks)} document chunks")

向量检索

Embedding 模型

Embedding 模型负责将文本转换为稠密向量（通常是 768 或 1536 维的浮点数数组）。语义相近的文本在向量空间中距离更近，这正是相似度检索的数学基础。

Embedding models are responsible for converting text into dense vectors (typically floating-point arrays of 768 or 1536 dimensions). Semantically similar texts are closer in the vector space, which is the mathematical foundation for similarity search.

常用 Embedding 模型对比：

Comparison of Common Embedding Models:


模型	维度	适用语言	特点
`text-embedding-3-small`（OpenAI）	1536	多语言	性价比高，适合大规模索引
`text-embedding-3-large`（OpenAI）	3072	多语言	精度最高，成本较高
`BAAI/bge-m3`	1024	中英文	开源，中文效果优秀，支持多语言
`sentence-transformers/all-MiniLM-L6-v2`	384	英文	体积小，速度快，适合本地极轻量部署

Model Dimensions Supported Languages Characteristics

text-embedding-3-small (OpenAI) 1536 Multilingual Cost-effective, suitable for large-scale indexing

text-embedding-3-large (OpenAI) 3072 Multilingual Highest accuracy, higher cost

BAAI/bge-m3 1024 Chinese & English Open-source, excellent Chinese performance, supports multilingual

sentence-transformers/all-MiniLM-L6-v2 384 English Small size, fast, suitable for extremely lightweight local deployment


Model	Dimensions	Supported Languages	Characteristics
`text-embedding-3-small` (OpenAI)	1536	Multilingual	Cost-effective, suitable for large-scale indexing
`text-embedding-3-large` (OpenAI)	3072	Multilingual	Highest accuracy, higher cost
`BAAI/bge-m3`	1024	Chinese & English	Open-source, excellent Chinese performance, supports multilingual
`sentence-transformers/all-MiniLM-L6-v2`	384	English	Small size, fast, suitable for extremely lightweight local deployment

相似度计算与 ANN 算法

检索的核心是度量距离。最常用的是余弦相似度（Cosine Similarity），它计算两个向量的夹角余弦值，值域 [-1, 1]，越接近 1 越相似。此外还有点积（Dot Product）和欧氏距离（L2 Distance）。

The core of retrieval is measuring distance. The most commonly used is Cosine Similarity, which calculates the cosine of the angle between two vectors, with a range of [-1, 1]; values closer to 1 indicate greater similarity. Other methods include Dot Product and Euclidean Distance (L2 Distance).

为了在百万级向量中实现毫秒级检索，数据库通常采用近似最近邻（ANN）算法（如 HNSW、IVF）。HNSW 是目前最主流的算法，它通过构建多层跳跃图网络，牺牲极少的精度换取了数量级的搜索速度提升。

To achieve millisecond-level retrieval among millions of vectors, databases typically employ Approximate Nearest Neighbor (ANN) algorithms (such as HNSW, IVF). HNSW is currently the most mainstream algorithm; it constructs a multi-layer skip graph network, trading minimal precision loss for orders-of-magnitude improvement in search speed.

Advanced RAG (进阶架构)

基础架构（Naive RAG）常面临检索不准确、冗余信息多导致"上下文淹没"等问题。Advanced RAG 通过预检索优化 → 检索融合 → 后检索优化的三段式架构予以解决。

The basic architecture (Naive RAG) often faces issues like inaccurate retrieval and "context flooding" due to redundant information. Advanced RAG addresses these through a three-stage architecture: Pre-retrieval Optimization → Retrieval Fusion → Post-retrieval Optimization.

1、预检索：查询优化

用户的原始问题往往表达不够精确：

Users' original questions are often not expressed precisely enough:

查询改写（Query Rewriting）：用 LLM 将口语化提问改写为规范化的检索词。

Query Rewriting: Uses an LLM to rewrite colloquial questions into standardized search terms.
HyDE（Hypothetical Document Embedding）：让 LLM 先"盲猜"一个假设性答案，由于生成的答案通常比原问题包含更多行业术语，用

常见问题（FAQ）

RAG技术具体是如何解决大语言模型幻觉问题的？

RAG通过从外部知识库检索最新、最相关的信息作为生成依据，让LLM基于事实作答，而非依赖其内部可能过时或不准确的记忆，从而有效减少编造答案的情况。

在RAG系统中，文档切分（Chunking）为什么很重要？

文档切分是离线索引流水线的关键步骤，它将原始文档处理成适合检索的小块。合理的切分策略直接影响向量检索的精度，是确保后续能准确匹配用户问题的前提。

RAG的在线查询流程具体包含哪些步骤？

在线流程包括：将用户问题转为查询向量，在向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.中检索最相似的文档块（Top-K），将问题和检索结果拼接成Prompt，最后输入给LLM生成最终答案。

AI Summary (BLUF)