DeepSeek-V3.2如何超越GPT-5？2026年三大技术突破解析

摘要

我们推出 DeepSeek-V3.2，这是一个在卓越推理与智能体性能和高计算效率之间取得平衡的模型。DeepSeek-V3.2 的关键技术突破如下：(1) DeepSeek 稀疏注意力 (DSA)：我们引入了 DSA，一种高效的注意力机制，能在长上下文场景中显著降低计算复杂度，同时保持模型性能。(2) 可扩展的强化学习框架：通过实施稳健的强化学习协议并扩展后训练计算量，DeepSeek-V3.2 的性能与 GPT-5 相当。值得注意的是，我们的高计算量变体 DeepSeek-V3.2-SpecialeAn enhanced version of DeepSeek-V3.2 focused on extreme reasoning capabilities and long-form thinking. 超越了 GPT-5，其推理能力与 Gemini-3.0-Pro 不相上下，在 2025 年国际数学奥林匹克竞赛 (IMO) 和国际信息学奥林匹克竞赛 (IOI) 中均取得了金牌级别的表现。(3) 大规模智能体任务合成流水线：为了将推理能力整合到工具使用场景中，我们开发了一种新颖的合成流水线，能够系统地大规模生成训练数据。这种方法促进了可扩展的智能体后训练，在复杂、交互式环境中显著提升了模型的泛化能力和指令遵循的鲁棒性。

We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-SpecialeAn enhanced version of DeepSeek-V3.2 focused on extreme reasoning capabilities and long-form thinking., surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

1. 引言

推理模型的发布标志着大型语言模型 (LLM) 演进的关键时刻，推动了其在可验证领域整体性能的显著飞跃。自这一里程碑以来，LLM 的能力迅速发展。然而，在过去几个月中出现了一个明显的分化趋势。尽管开源社区持续取得进展，但闭源专有模型的性能提升轨迹却以明显更快的速度加速。因此，闭源模型与开源模型之间的性能差距非但没有缩小，反而似乎在扩大，专有系统在复杂任务中展现出日益卓越的能力。

通过我们的分析，我们识别出限制开源模型在复杂任务中能力的三个关键不足。首先，在架构上，对标准注意力机制的普遍依赖严重限制了长序列处理的效率。这种低效性对可扩展部署和有效的后训练构成了重大障碍。其次，在资源分配方面，开源模型在后训练阶段的计算投入不足，限制了其在困难任务上的表现。最后，在 AI 智能体方面，与专有模型相比，开源模型在泛化能力和指令遵循能力上表现出明显滞后，阻碍了其在真实部署中的有效性。

为了应对这些关键限制，我们首先引入了 DSA，这是一种旨在显著降低计算复杂性的高效注意力机制。该架构有效解决了效率瓶颈，即使在长上下文场景中也能保持模型性能。其次，我们开发了一个稳定且可扩展的强化学习协议，允许在后训练阶段进行大规模计算扩展。值得注意的是，该框架分配的后训练计算预算超过了预训练成本的 10%，从而解锁了高级能力。第三，我们提出了一种新颖的流水线，以促进工具使用场景中的可泛化推理。首先，我们利用 DeepSeek-V3 的方法实施冷启动阶段，将推理和工具使用统一在单一轨迹中。随后，我们推进到大规模智能体任务合成，生成了超过 1,800 个不同的环境和 85,000 个复杂提示。这些广泛的合成数据驱动了强化学习过程，显著增强了模型在智能体上下文中的泛化能力和指令遵循能力。

DeepSeek-V3.2 在多个推理基准测试中与 Kimi-k2-thinking 和 GPT-5 取得了相似的性能。此外，DeepSeek-V3.2 显著提升了开源模型的智能体能力，在 EvalSys (2025) 等引入的长尾智能体任务上表现出色。DeepSeek-V3.2 成为智能体场景中一个极具成本效益的替代方案，在显著降低成本的同时，大大缩小了开源模型与前沿专有模型之间的性能差距。值得注意的是，为了推动开源模型在推理领域的边界，我们放宽了长度限制，开发了 DeepSeek-V3.2-SpecialeAn enhanced version of DeepSeek-V3.2 focused on extreme reasoning capabilities and long-form thinking.。因此，DeepSeek-V3.2-SpecialeAn enhanced version of DeepSeek-V3.2 focused on extreme reasoning capabilities and long-form thinking. 实现了与领先闭源系统 Gemini-3.0-Pro 的性能持平，并在 IOI 2025、ICPC 世界总决赛 2025、IMO 2025 和 CMO 2025 中展现出金牌级别的表现。

The release of reasoning models marked a pivotal moment in the evolution of Large Language Models (LLMs), catalyzing a substantial leap in overall performance across verifiable fields. Since this milestone, the capabilities of LLMs have advanced rapidly. However, a distinct divergence has emerged in the past months. While the open-source community continues to make strides, the performance trajectory of closed-source proprietary models has accelerated at a significantly steeper rate. Consequently, rather than converging, the performance gap between closed-source and open-source models appears to be widening, with proprietary systems demonstrating increasingly superior capabilities in complex tasks.

Through our analysis, we identify three critical deficiencies that limit the capability of open-source models in complex tasks. First, architecturally, the predominant reliance on vanilla attention mechanisms severely constrains efficiency for long sequences. This inefficiency poses a substantial obstacle to both scalable deployment and effective post-training. Second, regarding resource allocation, open-source models suffer from insufficient computational investment during the post-training phase, limiting their performance on hard tasks. Finally, in the context of AI agents, open-source models demonstrate a marked lag in generalization and instruction-following capabilities compared to their proprietary counterparts, hindering their effectiveness in real deployment.

To address these critical limitations, we first introduce DSA, a highly efficient attention mechanism designed to substantially reduce computational complexity. This architecture effectively addresses the efficiency bottleneck, preserving model performance even in long-context scenarios. Second, we develop a stable and scalable RL protocol that allows for significant computational expansion during the post-training phase. Notably, this framework allocates a post-training computational budget exceeding 10% of the pre-training cost, unlocking advanced capabilities. Thirdly, we propose a novel pipeline to foster generalizable reasoning in tool-use scenarios. First, we implement a cold-start phase utilizing the DeepSeek-V3 methodology to unify reasoning and tool-use within single trajectories. Subsequently, we advance to large-scale agentic task synthesis, where we generate over 1,800 distinct environments and 85,000 complex prompts. This extensive synthesized data drives the RL process, significantly enhancing the model’s generalization and instruction-following capability in the agent context.

DeepSeek-V3.2 achieves similar performance with Kimi-k2-thinking and GPT-5 across multiple reasoning benchmarks. Furthermore, DeepSeek-V3.2 significantly advances the agentic capabilities of open models, demonstrating exceptional proficiency on long-tail agent tasks. DeepSeek-V3.2 emerges as a highly cost-efficient alternative in agent scenarios, significantly narrowing the performance gap between open and frontier proprietary models while incurring substantially lower costs. Notably, with the aim of pushing the boundaries of open models in the reasoning domain, we relaxed the length constraints to develop DeepSeek-V3.2-SpecialeAn enhanced version of DeepSeek-V3.2 focused on extreme reasoning capabilities and long-form thinking.. As a result, DeepSeek-V3.2-SpecialeAn enhanced version of DeepSeek-V3.2 focused on extreme reasoning capabilities and long-form thinking. achieves performance parity with the leading closed-source system, Gemini-3.0-Pro. It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025.

2. DeepSeek-V3.2 架构

2.1. DeepSeek 稀疏注意力 (DSA)

DeepSeek-V3.2 采用了与 DeepSeek-V3.2-Exp 完全相同的架构。与 DeepSeek-V3.1 系列的最终版本 DeepSeek-V3.1-Terminus 相比，DeepSeek-V3.2 在架构上唯一的修改是通过持续训练引入了 DeepSeek 稀疏注意力 (DSA)。

DSA 的原型。DSA 的原型主要由两个组件构成：一个闪电索引器和一个细粒度的令牌选择机制。闪电索引器计算查询令牌 h𝑡 与前一个令牌 h𝑠 之间的索引分数 𝐼𝑡,𝑠，以确定查询令牌应选择哪些令牌。给定每个查询令牌 h𝑡 的索引分数 {𝐼𝑡,𝑠}，我们的细粒度令牌选择机制仅检索与 top-k 索引分数对应的键值条目 {c𝑠}。然后，通过应用查询令牌 h𝑡 与稀疏选择的键值条目 {c𝑠} 之间的注意力机制来计算注意力输出 u𝑡。

在 MLA 下实例化 DSA。考虑到从 DeepSeek-V3.1-Terminus 进行持续训练的需求，我们基于 MLA 为 DeepSeek-V3.2 实例化了 DSA。在核心层面，为了计算效率，每个键值条目必须在多个查询之间共享。因此，我们基于 MLA 的 MQA 模式实现了 DSA，其中每个潜在向量（MLA 的键值条目）将在查询令牌的所有查询头之间共享。基于 MLA 的 DSA 架构如图 2 所示。

DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp. Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2 is the introduction of DeepSeek Sparse Attention (DSA) through continued training.

Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism. The lightning indexer computes the index score 𝐼𝑡,𝑠 between the query token h𝑡 and a preceding token h𝑠, determining which tokens are to be selected by the query token. Given the index scores {𝐼𝑡,𝑠} for each query token h𝑡, our fine-grained token selection mechanism retrieves only the key-value entries {c𝑠} corresponding to the top-k index scores. Then, the attention output u𝑡 is computed by applying the attention mechanism between the query token h𝑡 and the sparsely selected key-value entries {c𝑠}.

Instantiate DSA Under MLA. For the consideration of continued training from DeepSeek-V3.1-Terminus, we instantiate DSA based on MLA for DeepSeek-V3.2. At the kernel level, each key-value entry must be shared across multiple queries for computational efficiency. Therefore, we implement DSA based on the MQA mode of MLA, where each latent vector (the key-value entry of MLA) will be shared across all query heads of the query token. The DSA architecture based on MLA is illustrated in Figure 2.

2.1.1. 持续预训练

从上下文长度已扩展到 128K 的 DeepSeek-V3.1-Terminus 基础检查点开始，我们执行持续预训练，然后进行后训练，以创建 DeepSeek-V3.2。DeepSeek-V3.2 的持续预训练包含两个训练阶段。对于这两个阶段，训练数据的分布与用于 DeepSeek-V3.1-Terminus 的 128K 长上下文扩展数据完全一致。

密集预热阶段。我们首先使用一个简短的预热阶段来初始化闪电索引器。在此阶段，我们保持密集注意力，并冻结除闪电索引器之外的所有模型参数。为了将索引器输出与主注意力分布对齐，对于第 𝑡 个查询令牌，我们首先通过对所有注意力头的注意力分数求和进行聚合。然后，该总和沿序列维度进行 L1 归一化，以产生目标分布 𝑝𝑡,:。基于 𝑝𝑡,:，我们设置一个 KL 散度损失作为索引器的训练目标。对于预热，我们使用 10^−3 的学习率。我们仅对索引器进行 1000 步训练，每步包含 16 个 128K 令牌的序列，总计 21 亿个令牌。

稀疏训练阶段。在索引器预热之后，我们引入细粒度的令牌选择机制，并优化所有模型参数，使模型适应 DSA 的稀疏模式。在此阶段，我们继续保持索引器输出与主注意力分布的对齐，但仅考虑选定的令牌集 S𝑡。值得注意的是，我们将索引器输入从计算图中分离出来进行单独优化。索引器的训练信号仅来自 L𝐼，而主模型的优化仅根据语言建模损失进行。在此稀疏训练阶段，我们使用 7.3 × 10^−6 的学习率，并为每个查询令牌选择 2048 个键值令牌。我们对主模型和索引器进行 15000 步训练，每步包含 480 个 128K 令牌的序列，总计 9437 亿个令牌。

Starting from a base checkpoint of DeepSeek-V3.1-Terminus, whose context length has been extended to 128K, we perform continued pre-training followed by post-training to create DeepSeek-V3.2. The continued pre-training of DeepSeek-V3.2 consists of two training stages. For both stages, the distribution of training data is totally aligned with the 128K long context extension data used for DeepSeek-V3.1-Terminus.

Dense Warm-up Stage. We first use a short warm-up stage to initialize the lightning indexer. In this stage, we keep dense attention and freeze all model parameters except for the lightning indexer. To align the indexer outputs with the main attention distribution, for the 𝑡-th query token, we first aggregate the main attention scores by summing across all attention heads. This sum is then L1-normalized along the sequence dimension to produce a target distribution 𝑝𝑡,:. Based on 𝑝𝑡,:, we set a KL-divergence loss as the training objective of the indexer. For warm-up, we use a learning rate of 10^−3. We train the indexer for only 1000 steps, with each step consisting of 16 sequences of 128K tokens, resulting in a total of 2.1B tokens.

Sparse Training Stage. Following indexer warm-up, we introduce the fine-grained token selection mechanism and optimize all model parameters to adapt the model to the sparse pattern of DSA. In this stage, we also keep aligning the indexer outputs to the main attention distribution, but considering only the selected token set S𝑡. It is worth noting that we detach the indexer input from the computational graph for separate optimization. The training signal of the indexer is from only L𝐼, while the optimization of the main model is according to only the language modeling loss. In this sparse training stage, we use a learning rate of 7.3 × 10^−6, and select 2048 key-value tokens for each query token. We train both the main model and the indexer for 15000 steps, with each step consisting of 480 sequences of 128K tokens, resulting in a total of 943.7B tokens.

2.2. 性能评估

标准基准测试。2025年9月，我们在一个专注于多样化能力的基准测试套件上评估了 DeepSeek-V3.2-Exp，并将其与表现相似的 DeepSeek-V3.1-Terminus 进行了比较。虽然 DeepSeek-V3.2-Exp 在长序列上显著提高了计算效率，但我们观察到，与 DeepSeek-V3.1-Terminus 相比，无论是在短上下文还是长上下文任务上，其性能都没有出现实质性下降。

人类偏好。鉴于直接的人类偏好评估本身容易受到偏见影响，我们采用 ChatbotArena 作为间接评估框架，以近似用户对新开发的基础模型的偏好。DeepSeek-V3.1-Terminus 和 DeepSeek-V3.2-Exp 共享相同的后训练策略，它们在 2025年11月10日评估中获得的 Elo 分数非常接近。这些结果表明，尽管采用了稀疏注意力机制，新的基础模型达到了与前一版本相当的性能。

长上下文评估。在 DeepSeek-V3.2-Exp 发布后，使用先前未见过的测试集进行了多项独立的长上下文评估。一个代表性的基准是 AA-LCR，在该基准中，DeepSeek-V3.2-Exp 在推理模式下比 DeepSeek-V3.1-Terminus 高出四分。在 Fiction.liveBench 评估中，DeepSeek-V3.2-Exp 在多个指标上持续优于 DeepSeek-V3.1-Terminus。这些证据表明，DeepSeek-V3.2-Exp 的基础检查点在长上下文任务上没有出现性能倒退。

Standard Benchmark. In September 2025, we evaluate DeepSeek-V3.2-Exp on a suite of benchmarks, which focus on diverse capabilities, and compare it with DeepSeek-V3.1-Terminus showing similar performance. While DeepSeek-V3.2-Exp significantly improves computational efficiency on long sequences, we do not observe substantial performance degradation compared with DeepSeek-V3.1-Terminus, on both short- and long-context tasks.

Human Preference. Given that direct human preference assessments are inherently susceptible to bias, we employ ChatbotArena as an indirect evaluation framework to approximate user preferences for the newly developed base models. Both DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp share an identical post-training strategy, and their Elo scores, obtained from evaluations conducted on 10 November 2025, are closely matched. These results suggest that the new base model achieves performance on par with the previous iteration, despite incorporating a sparse attention mechanism.

Long Context Eval. Following the release of DeepSeek-V3.2-Exp, several independent long-context evaluations were conducted using previously unseen test sets. A representative benchmark is AA-LCR, in which DeepSeek-V3.2-Exp scores four points higher than DeepSeek-V3.1-Terminus in reasoning mode. In the Fiction.liveBench evaluation, DeepSeek-V3.2-Exp consistently outperforms DeepSeek-V3.1-Terminus across multiple metrics. This evidence indicates the base checkpoint of DeepSeek-V3.2-Exp does not regress on long context tasks.

2.3. 推理成本

DSA 将主模型的核心注意力复杂度从 O(𝐿^2) 降低到 O(𝐿𝑘)，