GEO

法律RAG系统中,信息检索和推理哪个对性能影响更大?(附Legal RAG Bench基准测试结果)

2026/4/3
法律RAG系统中,信息检索和推理哪个对性能影响更大?(附Legal RAG Bench基准测试结果)
AI Summary (BLUF)

Legal RAG Bench, a new benchmark for legal RAG systems, reveals that information retrieval, not reasoning, is the primary performance driver. The Kanon 2 Embedder model outperforms competitors by 17 points on average, and most 'hallucinations' are actually triggered by retrieval failures.

原文翻译: 法律RAG Bench是一个新的法律RAG系统基准测试,揭示了信息检索(而非推理)是性能的主要驱动因素。Kanon 2 Embedder模型平均比竞争对手高出17分,大多数“幻觉”实际上是由检索失败触发的。

摘要

We are releasing Legal RAG Bench, a new reasoning-intensive benchmark and evaluation methodology for assessing the end-to-end, real-world performance of legal RAG systems.

我们发布了 Legal RAG Bench,这是一个新的、注重推理的基准测试和评估方法,用于评估法律RAG系统端到端的真实世界性能。

Our evaluation of state-of-the-art embedding and generative models on Legal RAG Bench reveals that information retrieval is the primary driver of legal RAG performance rather than reasoning. We find that the Kanon 2 Embedder legal embedding model, in particular, delivers an average accuracy boost of 17 points relative to other leading models.

我们在Legal RAG Bench上对最先进的嵌入模型和生成模型进行的评估表明,信息检索是法律RAG性能的主要驱动力,而非推理。我们发现,特别是Kanon 2 Embedder法律嵌入模型,相对于其他领先模型,平均准确率提升了17个百分点。

We also infer based on a statistically robust hierarchical error analysis that most errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures.

基于统计上稳健的分层错误分析,我们还推断出,在法律RAG系统中归因于“幻觉”的大多数错误,实际上是由检索失败引发的。

We conclude that information retrieval sets the ceiling on the performance of modern legal RAG systems. While strong retrieval can compensate for weak reasoning, strong reasoning often cannot compensate for poor retrieval.

我们的结论是,信息检索为现代法律RAG系统的性能设定了上限。强大的检索可以弥补推理的不足,但强大的推理往往无法弥补糟糕的检索。

In the interests of transparency, we have openly released Legal RAG Bench on Hugging Face and added it to the Massive Legal Embedding Benchmark (MLEB).

为了透明起见,我们已在Hugging Face上公开发布了Legal RAG Bench,并将其添加到大规模法律嵌入基准测试(MLEB)中。

2 March 2026: Legal RAG Bench has since been published as a paper.

2026年3月2日: Legal RAG Bench 已作为论文发表。

法律RAG评估的现状

In October 2025, we released the Massive Legal Embedding Benchmark (MLEB), the most comprehensive benchmark for legal text embedding models to date. Notably, we found that performance on existing legal retrieval benchmarks did not correlate strongly with performance on MLEB.

2025年10月,我们发布了大规模法律嵌入基准测试(MLEB),这是迄今为止最全面的法律文本嵌入模型基准测试。值得注意的是,我们发现现有法律检索基准测试的表现与MLEB上的表现关联性不强。

Leakage of evaluation data into the training datasets of commercial embedding models was identified as one potential cause of that mismatch. Another factor was found to be the relatively poor label quality and methodological unsoundness of many public legal evaluation sets.

评估数据泄露到商业嵌入模型的训练数据集中被认为是导致这种不匹配的一个潜在原因。另一个因素被发现是许多公共法律评估数据集标签质量相对较差和方法论不健全。

Consider, for example, the AILA Casedocs and AILA Statutes datasets, which comprise 25% of MTEB’s legal split. We determined that a significant number of query-passage pairs in them were wholly irrelevant to each other, having been created using an automated methodology that paired facts from cases with cases and statutes that had been cited by the lawyers arguing those cases.

AILA CasedocsAILA Statutes数据集为例,它们占MTEB法律类别的25%。我们确定其中大量查询-段落对彼此完全无关,这些数据是通过一种自动化方法创建的,该方法将案件中的事实与律师在辩论该案件时引用的案例和法规配对。

Given a basic understanding of how judgments are written and how legal citations work, it is clear that retrieval based solely on citation text is often impossible and of little practical value.

只要对判决书的撰写方式和法律引用的工作原理有基本了解,就会清楚仅基于引用文本进行检索通常是不可能的,并且几乎没有实际价值。

Such systemic flaws are observable even in some of the most popular LLM benchmarks. Our review of Humanity’s Last Exam (HLE)'s legal subset revealed that most examples were either inappropriate, poorly framed, or mislabeled.

这种系统性缺陷甚至在部分最流行的大语言模型基准测试中也能观察到。我们对**Humanity’s Last Exam (HLE)**法律子集的审查显示,大多数示例要么不恰当,要么框架设计不佳,要么标签错误。

Both LegalBench and LegalBench-RAG suffer from a mismatch between marketing and reality. Despite being marketed as stress tests for reasoning and retrieval, the vast majority of their data is comprised of low-value, trivial text classification tasks.

LegalBenchLegalBench-RAG都存在宣传与实际不符的问题。尽管被宣传为推理和检索的压力测试,但它们的大部分数据实际上由低价值、琐碎的文本分类任务组成。

All the problems highlighted above are emblematic of a fundamental misunderstanding of legal work on the part of AI practitioners, most of whom do not have legal backgrounds.

上述所有问题都象征着人工智能从业者对法律工作的根本性误解,他们中的大多数人没有法律背景。

The consequences for users of open-source legal benchmarks are grave. Users consume and act upon a harmful view of model utility, and model creators are incentivized to cut corners, sacrificing genuine performance.

对于开源法律基准测试的用户来说,后果是严重的。用户接受并基于对模型实用性的有害观点采取行动,而模型创建者则被激励去走捷径,牺牲真正的性能。

MLEB, Kanon 2 Embedder, and now Legal RAG Bench are intended to break that trend. Informed by deep subject matter expertise in law and AI, Legal RAG Bench offers a fresh approach to evaluating real-world usefulness.

MLEB、Kanon 2 Embedder以及现在的Legal RAG Bench旨在打破这种趋势。凭借在法律和人工智能领域的深厚专业知识,Legal RAG Bench提供了一种评估真实世界实用性的新方法。

Legal RAG Bench有何不同

Legal RAG Bench is both a dataset and an evaluation methodology.

Legal RAG Bench既是一个数据集,也是一种评估方法。

As a dataset, Legal RAG Bench consists of 4,876 passages sampled from the Judicial College of Victoria’s Criminal Charge Book paired with 100 complex, handwritten questions demanding expert-level knowledge of Victorian criminal law and procedure. It represents the first evaluation set to assess RAG systems aimed at providing practical, real-world legal advice in an under-resourced but vital domain: criminal law.

作为一个数据集,Legal RAG Bench包含从维多利亚州司法学院刑事指控手册中采样的4,876个段落,并配以100个复杂的手写问题,这些问题需要维多利亚州刑法和程序的专家级知识才能正确回答。它代表了第一个旨在评估为资源不足但至关重要的领域(刑法)提供实用、真实世界法律建议的RAG系统的评估集。

Uniquely, subject matter expertise in law and AI informed every stage of its design and development, from crafting realistic scenarios to selecting the foundational legal text.

其独特之处在于,法律和人工智能领域的专业知识指导了其设计和开发的每个阶段,从构思现实场景到选择基础法律文本。

As a methodology, Legal RAG Bench constitutes the first full factorial experiment evaluating legal RAG systems, enabling empirical apples-to-apples assessments of the relative impact of retrieval and generative models.

作为一种方法,Legal RAG Bench构成了评估法律RAG系统的首次全因子实验,能够对检索模型和生成模型的相对影响进行实证的、公平的比较。

我们如何构建Legal RAG Bench

In constructing Legal RAG Bench, we downloaded the Criminal Charge Book, converted it to Markdown, and used a complex set of heuristics and the semchunk semantic chunking algorithm to create a corpus of 4,876 passages, with no chunk exceeding 512 tokens.

在构建Legal RAG Bench时,我们下载了刑事指控手册,将其转换为Markdown格式,并使用一套复杂的启发式方法和semchunk语义分块算法创建了一个包含4,876个段落的语料库,每个块不超过512个标记。

After building our corpus, we randomly sampled passages and hand-crafted 100 complex questions designed to require each specific passage to be answered correctly. Questions were made as lexically dissimilar from relevant passages as possible to stress test semantic understanding.

构建语料库后,我们随机采样段落,并手工设计了100个复杂问题,这些问题被设计为需要依赖每个特定段落才能正确回答。我们使问题在词汇上与相关段落尽可能不同,以压力测试模型的语义理解能力。

模型在Legal RAG Bench上的表现

We evaluated three state-of-the-art embedding models and two state-of-the-art generative models based on their popularity and purported legal performance.

我们基于流行度和宣称的法律性能,评估了三个最先进的嵌入模型和两个最先进的生成模型。

To minimize confounding variables, we used the same barebones LangChain-based RAG pipeline for all evaluations, with temperature fixed at zero.

为了最小化混淆变量,我们在所有评估中使用相同的、基于LangChain的基础RAG流程,并将温度参数固定为零。

We used GPT-5.2 in high reasoning mode to evaluate three binary metrics for each combination:

  • Correctness: Was the generative model's response correct?
  • Groundedness: Was the response supported by the retrieved passages?
  • Retrieval Accuracy: Was the relevant passage retrieved?

我们使用GPT-5.2的高推理模式来评估每种组合的三个二元指标:

  • 正确性:生成模型的回答是否正确?
  • 有据性:回答是否得到检索段落的支持?
  • 检索准确率:相关段落是否被检索到?

The table below reports the average scores for each model combination.

下表报告了每种模型组合的平均得分。

嵌入模型 生成模型 平均正确性 平均有据性 平均检索准确率
Kanon 2 Embedder Gemini 3.1 Pro 79.3% 94.7% 89.0%
GPT-5.2 78.0% 88.7% 89.0%
OpenAI Text Embedding 3 Large Gemini 3.1 Pro 61.3% 90.0% 55.0%
GPT-5.2 61.3% 84.7% 55.0%
Gemini Embedding 001 Gemini 3.1 Pro 47.3% 89.3% 38.0%
GPT-5.2 47.3% 83.3% 38.0%

These results highlight that end-to-end legal RAG performance is primarily driven by the choice of embedding model, whereas generative models have a mild effect. The Kanon 2 Embedder had the largest positive impact, with an average increase of 17.5 points in correctness and 34 points in retrieval accuracy relative to its next best alternative.

这些结果突显,端到端的法律RAG性能主要由嵌入模型的选择驱动,而生成模型的影响较小。Kanon 2 Embedder产生了最大的积极影响,相对于次优替代方案,其正确性平均提高了17.5个百分点检索准确率提高了34个百分点

Notably, when using Kanon 2 Embedder, performance remained stable across generative models. It effectively shifts errors into the generative layer, meaning retrieval failures no longer handicap the system.

值得注意的是,当使用Kanon 2 Embedder时,性能在不同生成模型间保持稳定。它有效地将错误转移到了生成层,这意味着检索失败不再拖累系统性能。

法律“幻觉”应归咎于谁?

Because Legal RAG Bench provides both relevant passages and correct answers, we can attribute errors to the models that triggered them.

因为Legal RAG Bench同时提供了相关段落和正确答案,我们可以将错误归因于引发它们的模型。

Consider an example where GPT-5.2 answers a question correctly. Without knowing the retrieved passages, we might credit the generative model. However, Legal RAG Bench reveals that the embedding model (e.g., Gemini Embedding 001) completely failed to retrieve any relevant passages. GPT-5.2 generated the correct answer from its internal knowledge, which qualifies as a hallucination in our evaluation framework because it was not grounded in the provided sources.

考虑一个GPT-5.2正确回答问题的例子。如果不知道检索到的段落,我们可能会归功于生成模型。然而,Legal RAG Bench显示嵌入模型(例如Gemini Embedding 001)完全未能检索到任何相关段落。GPT-5.2从其内部知识生成了正确答案,这在我们的评估框架中属于**“幻觉”**,因为它没有基于提供的来源。

We instructed models to answer solely based on retrieved passages because real-world legal RAG applications demand verifiability more than raw correctness. From an end user's perspective, a correct answer without supporting evidence is indistinguishable from an incorrect one.

我们指示模型仅根据检索到的段落进行回答,因为真实世界的法律RAG应用需要可验证性胜过原始正确性。从最终用户的角度来看,没有证据支持的正确回答与错误回答无法区分。

Switching generative models has a moderate effect on hallucinations, with Gemini 3.1 Pro having an average hallucination rate of 5.7% and GPT-5.2 having a rate of 11.3%.

切换生成模型对“幻觉”率有中等程度的影响,Gemini 3.1 Pro的平均“幻觉”率为5.7%,GPT-5.2为11.3%

Generative errors are proportionally higher with Kanon 2 Embedder; however, that is because a dramatic reduction in upstream retrieval errors shifts failures into the generative layer. Instead of poor retrieval handicapping generative models, it is now inferior generative models handicapping high-quality retrieval.

使用Kanon 2 Embedder时,生成错误的比例更高;然而,这是因为上游检索错误的大幅减少将失败转移到了生成层。现在不是糟糕的检索拖累生成模型,而是较差的生成模型拖累了高质量的检索。

We also measure each model’s deviation from the average RAG accuracy for a robust comparison. Our results show that Kanon 2 Embedder delivers 18% better overall RAG accuracy when compared to the sample average. GPT-5.2 and Gemini 3.1 Pro impact accuracy by -3% and +3%, respectively.

我们还测量了每个模型相对于平均RAG准确率的偏差,以进行稳健的比较。我们的结果表明,与样本平均值相比,Kanon 2 Embedder的总体RAG准确率高出18%。GPT-5.2和Gemini 3.1 Pro对准确率的影响分别为**-3%和+3%**。

We also tested Gemini 3 Pro prior to its successor. Notably, Gemini 3 Pro actually scored slightly higher (80.3% accuracy) than Gemini 3.1 Pro (79.3%), a 1-point difference.

我们也测试了其前代模型Gemini 3 Pro。值得注意的是,Gemini 3 Pro的实际得分(80.3%准确率)略高于Gemini 3.1 Pro(79.3%),相差1个百分点。

(Note: The original input content ended mid-sentence. The blog post has been concluded gracefully at a logical breaking point after summarizing key comparative results.)

常见问题(FAQ)

Legal RAG Bench基准测试的主要发现是什么?

该基准测试揭示,在法律RAG系统中,信息检索(而非推理)是性能的主要驱动因素,检索质量直接决定了系统性能的上限。

Kanon 2 Embedder模型表现如何?

Legal RAG Bench上,Kanon 2 Embedder法律嵌入模型平均准确率比其他领先模型高出17个百分点,表现优异。

法律RAG系统中的“幻觉”错误通常由什么引起?

基于分层错误分析,大多数归因于“幻觉”的错误实际上是由检索失败触发的,而非生成模型本身的推理问题。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。