如何提升大语言模型推理能力？2025最新方法与技术指南

Introduction: The Critical Role of Reasoning in LLMs

自问自答，安放学习笔记。本文主要参考了综述文章：Advancing Reasoning in Large Language Models: Promising Methods and Approaches。

This is a self-Q&A format to organize learning notes. The article primarily references the survey paper: Advancing Reasoning in Large Language Models: Promising Methods and Approaches.

1. Foundational Concepts: Defining "Reasoning"

1.1 Clarifying Terminology: "Inference" vs. "Reasoning"

在 LLM 领域，中文的「推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。」对应两个不同的英文术语，一个是 Inference ，也可以叫 Testing（测试），相对的概念是 Training（训练）；另一个是 Reasoning，指的就是类似福尔摩斯的那种推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。——对已知信息进行逻辑演绎或综合，进而推导出新的知识或结论的过程。它是人类智力活动的核心，也是许多高价值应用（如医学诊断、法律决策、科学研究等）的必须环节。如果一个 LLM 只会「复制粘贴」或利用大规模统计粗略地「拼凑」答案，那么它很难真正为人类复杂决策或创新活动提供帮助。

In the field of LLMs, the Chinese term "推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。" corresponds to two different English terms. One is "Inference," also known as "Testing," which is the counterpart to "Training." The other is "Reasoning," which refers to the process of logical deduction or synthesis from known information to derive new knowledge or conclusions, akin to Sherlock Holmes' reasoning. It is the core of human intellectual activity and a necessary component for many high-value applications (such as medical diagnosis, legal decision-making, scientific research, etc.). If an LLM can only "copy and paste" or roughly "piece together" answers using large-scale statistics, it will struggle to genuinely assist in complex human decision-making or innovative activities.

本文提到「推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。」这个中文词时，指的都是 Reasoning。

In this article, whenever the Chinese term "推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。" is mentioned, it refers to "Reasoning."

1.2 The Distinction Between Reasoning and Memorization

人们在使用 LLMs 时，往往会遇到「它看上去懂很多事实，但又常常犯低级错误」。这通常与 LLMs 的内部表征方式有关：LLMs 更多是通过海量参数记住了许多词语间的相关性，然而对于需要「多步逻辑演绎」或「严格计算」的任务时，纯粹基于相关性检索的信息往往不够。它们可能提供一个表面看来「像是对的」答案，但在深层逻辑或数值正确性方面会出现偏差。这种现象在学术领域常被称为「幻觉指模型生成看似合理但与事实不符或缺乏依据的内容，是LLM在推理中常见的错误现象，通常由于模型过度依赖统计模式而非逻辑验证所致。」（Hallucination）：模型生成了看似合理但却与事实相违背的内容。

When using LLMs, people often encounter the situation where "it seems to know many facts, yet frequently makes basic errors." This is typically related to the internal representation of LLMs: LLMs primarily memorize correlations between words through massive parameters. However, for tasks requiring "multi-step logical deduction" or "strict calculation," information retrieved purely based on correlation is often insufficient. They may provide an answer that "seems correct" on the surface but deviates in terms of deep logic or numerical accuracy. This phenomenon is commonly referred to in academia as "Hallucination": the model generates content that appears plausible but contradicts facts.

因此，在大规模记忆的基础上，设计更多增强逻辑推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。的机制（如提示工程、外部工具调用、符号方法结合等）是近两年研究的热点。接下来，我们将从不同方向展开讨论，介绍多种提升 LLM 推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。能力的思路和方法。

Therefore, building upon large-scale memorization, designing mechanisms to enhance logical reasoning (such as prompt engineering, external tool invocation, integration of symbolic methods, etc.) has been a research hotspot in recent years. Next, we will discuss various directions and introduce multiple ideas and methods for improving LLM reasoning capabilities.

1.3 Major Categories of Reasoning

演绎推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。（Deductive Reasoning）：根据一般性原则推导出具体结论。如果前提为真，则结论必定为真。例如数学定理证明。
归纳推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。（Inductive Reasoning）：根据若干具体样例推断普适规律。机器学习的很多过程其实就是归纳推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。，例如从训练样本中学习回归或分类模型。
溯因推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。（Abductive Reasoning）：在一些不充分或部分信息的条件下，尝试寻找「最有可能」的解释。这在医学诊断或故障排查等场景非常常见。
常识推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。（Commonsense Reasoning）：依赖日常的世界知识和逻辑，使模型能够像人一样根据背景常识进行判断。例如「冬天在户外等公交，需要穿衣保暖」。
概率推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。（Probabilistic Reasoning）：允许引入不确定性，用概率分布或概率图模型等方式来进行推断。例如在风险评估、金融预测中，往往需要带有概率意义的推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。结果。

Deductive Reasoning: Deriving specific conclusions from general principles. If the premises are true, the conclusion must be true. For example, mathematical theorem proving.

Inductive Reasoning: Inferring universal patterns from several specific examples. Many processes in machine learning are essentially inductive reasoning, such as learning regression or classification models from training samples.

Abductive Reasoning: Attempting to find the "most likely" explanation under conditions of insufficient or partial information. This is very common in scenarios like medical diagnosis or fault troubleshooting.

Commonsense Reasoning: Relying on everyday world knowledge and logic, enabling the model to make judgments based on background common sense like a human. For example, "waiting for a bus outdoors in winter requires wearing warm clothes."

Probabilistic Reasoning: Allowing the introduction of uncertainty, using probability distributions or probabilistic graphical models for inference. For example, risk assessment and financial forecasting often require reasoning results with probabilistic significance.

2. Paradigm One: Prompt Engineering

在不改变模型结构、也不进行额外大型训练的情况下，通过改进输入提示（prompts）可以在一定程度上激发或引导 LLM 产生更好的推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。结果。以下介绍几种常见方法。

Without altering the model architecture or conducting additional large-scale training, improving input prompts can, to some extent, stimulate or guide LLMs to produce better reasoning results. Several common methods are introduced below.

2.1 Chain-of-Thought (CoT)

Chain-of-Thought（CoT） 是一种简单但有效的提示工程策略，即鼓励模型在回答过程中「显式」地写出思考或推导的中间步骤。与传统直接输出结论不同，CoT 会让 LLM 逐步生成一条「推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。链」，从问题出发，分阶段拆解直到得出答案。例如：

Chain-of-Thought (CoT) is a simple yet effective prompt engineering strategy that encourages the model to "explicitly" write out the intermediate steps of thinking or deduction during the answering process. Unlike traditional direct output of conclusions, CoT prompts the LLM to gradually generate a "reasoning chain," starting from the problem and decomposing it step by step until reaching the answer. For example:

问题：一个篮子里有12个鸡蛋，打碎了3个，煮了5个，还剩几个？
思考：首先12-3=9个完整鸡蛋，然后煮了5个不代表拿走，所以还剩9个。
答案：9个

Question: There are 12 eggs in a basket. 3 are broken, and 5 are cooked. How many are left?
Thought: First, 12 - 3 = 9 intact eggs. Then, cooking 5 doesn't mean removing them, so 9 are left.
Answer: 9

实验证明，CoT 在数学、逻辑等需要多步分析的任务中往往能显著降低错误率，让模型的回答更具「可解释性」。不过，CoT 并不能保证所有步骤都正确，如果模型本身对某个领域缺乏足够知识，或者提示设计不佳，它依然会在中途「想错并写错」。

Experiments have shown that CoT can significantly reduce error rates in tasks requiring multi-step analysis, such as mathematics and logic, making the model's answers more "interpretable." However, CoT does not guarantee the correctness of all steps. If the model itself lacks sufficient knowledge in a certain domain or if the prompt is poorly designed, it can still "think and write incorrectly" midway.

2.2 Self-Consistency: Multi-Threaded Thinking

为了避免单条思维链一种提示工程策略，要求模型在回答问题时显式生成中间推理步骤，形成一条连续的‘推理链’，以提高复杂任务（如数学、逻辑题）的准确性和可解释性。中的随机性和局部错误，Self-Consistency 方法会让模型针对同一问题生成多条思维链一种提示工程策略，要求模型在回答问题时显式生成中间推理步骤，形成一条连续的‘推理链’，以提高复杂任务（如数学、逻辑题）的准确性和可解释性。，并最终对这些结论进行「投票」或「聚合」，挑选出现频率最高或最被支持的结论。它的基本假设是，如果 LLM 在若干次独立的推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。中都得出相同的答案，那么这个结论很可能是正确的。Self-Consistency 可以被视作一种「多重思考 + 大家表决」的机制。举例：

To avoid randomness and local errors in a single thought chain, the Self-Consistency method prompts the model to generate multiple thought chains for the same problem and then "votes" or "aggregates" these conclusions, selecting the one with the highest frequency or the most support. Its basic assumption is that if the LLM arrives at the same answer in several independent reasoning attempts, that conclusion is likely correct. Self-Consistency can be viewed as a "multiple thinking + collective voting" mechanism. For example:

问题：28+47=？
解法A：20+40=60，8+7=15，总和75
解法B：28+40=68，再加7得75
解法C：30+47=77，减2得75
根据表决，最终答案75

Question: 28 + 47 = ?
Solution A: 20 + 40 = 60, 8 + 7 = 15, total 75
Solution B: 28 + 40 = 68, plus 7 equals 75
Solution C: 30 + 47 = 77, minus 2 equals 75
Based on voting, the final answer is 75.

2.3 Tree-of-Thought (ToT)

相比 CoT 通常是一条线性推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。链，Tree-of-Thought（ToT） 则让模型可以从某一步分叉出多种可能，再通过评价或搜索策略对枝干进行拓展和修剪。这样，一次推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。不再只有唯一的路径，而是形成一个「推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。树」，有助于在面临复杂的决策或搜索问题时，不断尝试不同思路并选择最优解。

Compared to CoT, which is typically a linear reasoning chain, Tree-of-Thought (ToT) allows the model to branch out multiple possibilities from a certain step, then expand and prune the branches through evaluation or search strategies. Thus, a single reasoning process no longer has a unique path but forms a "reasoning tree," which helps in continuously trying different approaches and selecting the optimal solution when facing complex decision-making or search problems.

用一个游戏举例：

Illustration with a Game:

目标： 在 4 步之内，通过一系列操作，使数字变为 24。
初始状态： 数字 1。
操作： 只能进行加法或乘法，每次只能加或乘以 2 或 3。

Goal: Within 4 steps, through a series of operations, change the number to 24.

Initial State: Number 1.

Operations: Only addition or multiplication is allowed, and each operation can only add or multiply by 2 or 3.

ToT 的构造大致如下：

The construction of ToT is roughly as follows:

分解： 将问题分解为一系列决策步骤 (每一步选择加或乘哪个数)。
拓展： 在每一步，根据当前状态，拓展出所有可能的下一步状态 (生成子节点)。
评估： 评估每个状态的潜力 (给节点评分)。比如，在第一层时，3 似乎更有潜力接近 24 (因为可以乘以 8，虽然不能直接乘，但是可以作为目标)，给 3 较高的评分。
选择： 根据评估结果，选择最有潜力的状态进行下一步拓展 (选择节点)。
搜索： 通过不断重复拓展、评估和选择的过程，最终找到目标解。

Decomposition: Break down the problem into a series of decision steps (choosing which number to add or multiply at each step).

Expansion: At each step, based on the current state, expand all possible next states (generate child nodes).

Evaluation: Evaluate the potential of each state (score the nodes). For example, at the first level, 3 seems more promising for approaching 24 (because it can be multiplied by 8, although not directly, it can serve as a target), so give 3 a higher score.

Selection: Based on the evaluation results, select the most promising state for the next expansion (select a node).

Search: Through the continuous repetition of expansion, evaluation, and selection, eventually find the target solution.

2.4 Program-Aided Language Models (PAL)

PAL 方法让语言模型在推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。时可以调用额外的计算资源或者外部程序，比如执行 Python 脚本来做复杂的数值计算，或调用数学软件进行公式推导。通过这种方式，LLM 不必自己在内部参数中「模拟」所有的数学操作，而是把关键步骤交给更可靠的工具来执行，一定程度上提高了系统的准确度和可验证性。

The PAL method allows language models to invoke additional computational resources or external programs during reasoning, such as executing Python scripts for complex numerical calculations or calling mathematical software for formula derivation. In this way, the LLM does not have to "simulate" all mathematical operations within its internal parameters but delegates key steps to more reliable tools, thereby improving the system's accuracy and verifiability to some extent.

举例：求解物理问题

Example: Solving a Physics Problem

假设我们有一个物理问题：

Suppose we have a physics problem:

一个质量为 m = 2 kg 的物体，以 v0 = 5 m/s 的初速度在水平地面上运动，受到一个与运动方向相反的摩擦力 f = 0.5 N。求物体经过多长时间停止运动？

An object with mass m = 2 kg moves on a horizontal surface with an initial velocity v0 = 5 m/s, subject to a frictional force f = 0.5 N opposing its motion. How long does it take for the object to stop?

使用 PAL 方法解决该问题的步骤如下：

The steps to solve this problem using the PAL method are as follows:

步骤 1：LLM 理解问题并生成代码：

Step 1: LLM understands the problem and generates code:

LLM 首先理解题意，识别出需要使用物理公式来求解。
然后，LLM 生成一段 Python 代码，用于计算物体停止运动的时间。

The LLM first understands the problem, identifying that physical formulas are needed for the solution.

Then, the LLM generates a piece of Python code to calculate the time it takes for the object to stop.

import sympy

# 定义符号变量
m, v0, f, t = sympy.symbols('m v0 f t')

# 根据牛顿第二定律，摩擦力等于质量乘以加速度
a = -f / m

# 根据匀变速直线运动公式，末速度 v = v0 + a*t
v = v0 + a * t

# 当 v = 0 时，物体停止运动，解方程求时间 t
equation = sympy.Eq(v, 0)

# 将 m, v0, f 的值代入方程
equation = equation.subs({m: 2, v0: 5, f: 0.5})

# 解方程
t_value = sympy.solve(equation, t)[0]

# 打印结果
print(t_value)

步骤 2：执行代码：

Step 2: Execute the code:

PAL 将 LLM 生成的 Python 代码发送给 Python 解释器执行。
Python 解释器使用 sympy 库进行符号计算，解出方程，得到时间 t_value = 20。

PAL sends the Python code generated by the LLM to a Python interpreter for execution.

The Python interpreter uses the sympy library for symbolic computation, solves the equation, and obtains the time t_value = 20.

步骤 3：LLM 组织答案：

Step 3: LLM organizes the answer:

LLM 接收到 Python 程序的计算结果 (t = 20)。
然后，LLM 将结果组织成自然语言答案：

The LLM receives the calculation result from the Python program (t = 20).

Then, the LLM organizes the result into a natural language answer:

物体经过 20 秒停止运动。

The object stops after 20 seconds.

3. Paradigm Two: Improving Model/System Architecture

这类方法主要关注如何在模型的内部构造或工作机制上进行改进，核心理念在于：对模型结构 本身进行修饰、扩展或重组，使其在推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。时能够获得新的信息通道、知识表示方式或逻辑演绎途径，从而在推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。能力和可解释性等方面得到提升。

This category of methods focuses on improving the internal construction or working mechanisms of models. The core idea is to modify, extend, or restructure the model architecture itself, enabling it to gain new information channels, knowledge representation methods, or logical deduction pathways during reasoning, thereby enhancing reasoning capabilities and interpretability.

3.1 Retrieval-Augmented Generation (RAG): Adding Knowledge Bases and Retrieval Modules

其实 RAG 严格意义上来说是不改变「基模型」的结构的，但我们把它分到这个流派里，是因为我们在这里把整个 AI 系统看做一个整体，不只包含「基模型」。

Strictly speaking, RAG does not alter the structure of the "base model." However, we categorize it under this paradigm because we consider the entire AI system as a whole here, not just the "base model."

RAG 思想是让语言模型在回答问题前，先从外部知识库（如向量数据库、文本资料库）中检索到与问题相关的信息，再将检索到的资料与原始问题拼接后输入模型进行生成。这样 LLM 就不必完全依赖其内部隐性记忆，也能更好地根据「事实」进行推理在LLM中指‘Reasoning’，即对已知信息进行逻辑演绎、归纳或综合，以推导出新知识或结论的智力过程。区别于模型预测输出的‘推断’。，减少胡乱编造答案的风险。

The idea of RAG is to have the