GEO

DeepSeek是否从GPT蒸馏而来?2025知识蒸馏技术分析 | Geoz.com.cn

2026/2/16
DeepSeek是否从GPT蒸馏而来?2025知识蒸馏技术分析 | Geoz.com.cn
AI Summary (BLUF)

Knowledge distillation is a model training technique where a smaller student model learns from a larger teacher model, improving efficiency while maintaining performance. This article analyzes whether DeepSeek models were distilled from GPT, examining data, logits, and feature distillation methods. (知识蒸馏是一种模型训练技术,通过教师-学生架构让小模型从大模型中学习知识,在提升效率的同时保持性能。本文深入分析DeepSeek是否从GPT蒸馏而来,探讨数据蒸馏、Logits蒸馏和特征蒸馏三种方法。)

引言

图1: DeepSeek 蒸馏小模型

2025年1月20日,DeepSeek 正式发布了 DeepSeek-R1,并同步开源了模型权重。此后,DeepSeek-R1 一路领先,迅速成为全球最受瞩目的大语言模型,成功超越 ChatGPT,登顶美区苹果应用商店免费 App 排行榜首位!

图1: DeepSeek Distilled Small Model

On January 20, 2025, DeepSeek officially released DeepSeek-R1 and simultaneously open-sourced its model weights. Since then, DeepSeek-R1 has surged ahead, rapidly becoming the world's most prominent large language model. It successfully surpassed ChatGPT and claimed the top spot on the U.S. Apple App Store's free app charts!

图2: DeepSeek 登顶美国下载榜第一

在之前的文章《DeepSeek-R1 技术报告解读》中,我们介绍了 DeepSeek-R1 如何通过大规模强化学习实现“顿悟”。除了强化学习技术外,蒸馏技术也是另一个值得关注的亮点。通过蒸馏小模型,DeepSeekR1-Distill-Qwen-7B 在 AIME 2024 上取得了 55.5% 的准确率,超越了 QwQ-32B-Preview。同时,DeepSeek-R1-Distill-Qwen-32B 在 AIME 2024 上的准确率高达 72.6%,在 MATH-500 上的准确率为 94.3%,在 LiveCodeBench 上的得分也达到了 57.2%。这些成绩显著优于以往的开源模型,与 o1-mini 相当。

图2: DeepSeek Tops U.S. Download Charts

In a previous article, "DeepSeek-R1 Technical Report Analysis," we discussed how DeepSeek-R1 achieved "insight" through large-scale reinforcement learning. Beyond reinforcement learning, distillation technology is another noteworthy highlight. Through model distillation, DeepSeekR1-Distill-Qwen-7B achieved a 55.5% accuracy rate on AIME 2024, surpassing QwQ-32B-Preview. Meanwhile, DeepSeek-R1-Distill-Qwen-32B achieved a high accuracy of 72.6% on AIME 2024, 94.3% on MATH-500, and a score of 57.2% on LiveCodeBench. These results are significantly better than previous open-source models and are comparable to o1-mini.

而知乎上的一个高关注问题,就是关于 DeepSeek 蒸馏相关的:许多人说DeepSeek是从GPT蒸馏出来的,这是真的吗? 接下来,我们将详细介绍 DeepSeek 的蒸馏技术,相信读完之后,大家就能轻松解答这个问题了。

A highly-discussed question on Zhihu concerns DeepSeek's distillation: Many people claim DeepSeek is distilled from GPT. Is this true? Next, we will delve into the details of DeepSeek's distillation technology. We believe that after reading this, you will be able to easily answer this question.

什么是知识蒸馏

根据维基百科的定义,知识蒸馏(knowledge distillation) 是人工智能领域的一项模型训练技术。该技术透过类似于教师—学生的方式,令规模较小、结构较为简单的人工智能模型从已经经过充足训练的大型、复杂模型身上学习其掌握的知识。该技术可以让小型简单模型快速有效学习到大型复杂模型透过漫长训练才能得到的结果,从而改善模型的效率、减少运算开销,因此亦被称为模型蒸馏(model distillation)。

According to Wikipedia, Knowledge Distillation is a model training technique in the field of artificial intelligence. This technique employs a teacher-student paradigm, enabling a smaller, structurally simpler AI model to learn the knowledge possessed by a larger, complex model that has already been fully trained. This technique allows the small, simple model to quickly and effectively learn the results that the large, complex model obtained through lengthy training, thereby improving model efficiency and reducing computational costs. Hence, it is also known as Model Distillation.

知识蒸馏并不是什么新技术,早在 2006 年, Bucilua 等人最先提出将大模型的知识迁移到小模型的想法。2015 年,Hinton 正式提出广为人知的知识蒸馏的概念。其主要的想法是:学生模型通过模仿教师模型来获得和教师模型相当的精度,关键问题是如何将教师模型的知识迁移到学生模型。

Knowledge distillation is not a new technology. As early as 2006, Bucilua et al. first proposed the idea of transferring knowledge from large models to small ones. In 2015, Hinton formally introduced the widely recognized concept of knowledge distillation. The core idea is that a student model achieves accuracy comparable to the teacher model by imitating it. The key challenge lies in how to transfer the teacher model's knowledge to the student model.

图3: 知识蒸馏教师-学生架构

Figure 3: Knowledge Distillation Teacher-Student Architecture

目前最常用的蒸馏方法有三种:数据蒸馏Logits蒸馏特征蒸馏。下面分别介绍下这三种蒸馏方法。

Currently, the three most commonly used distillation methods are: Data Distillation, Logits Distillation, and Feature Distillation. We will introduce each of these methods below.

数据蒸馏

数据蒸馏过程中,教师模型首先生成<问题, 答案> pair,这些 pair 随后被用来训练学生模型。例如,DeepSeekR1-Distill-Qwen-32B 模型就是通过使用 DeepSeek-r1 生成的80万条数据,在 Qwen2.5-32B 基础上直接进行 SFT 而得到的。

In the data distillation process, the teacher model first generates <question, answer> pairs. These pairs are then used to train the student model. For example, the DeepSeekR1-Distill-Qwen-32B model was obtained by directly performing Supervised Fine-Tuning (SFT) on the Qwen2.5-32B base model using 800,000 data points generated by DeepSeek-R1.

图4: 数据蒸馏

Figure 4: Data Distillation

Logits蒸馏

Logits 蒸馏指的是神经网络在应用 softmax 函数之前的原始输出分数。在这一过程中,学生模型被训练以模仿教师的 logits,而不仅仅是其最终预测结果,从而保留了教师模型更多的信息。

Logits distillation refers to the raw output scores of a neural network before applying the softmax function. In this process, the student model is trained to mimic the teacher's logits, not just its final predictions, thereby preserving more information from the teacher model.

图5: Logits蒸馏

Figure 5: Logits Distillation

在 Logits 蒸馏中,教师模型的输出是每个 token 的概率分布。比如教师模型的 Logits 输出为 [0.7,0.2,0.1,0,0,0,0,0,0,0]。为了让学生模型更好地学习这些知识,通常会引入温度系数 T。在 Logits 蒸馏中,温度参数 T 是一个关键的超参数,它主要用于调节教师模型输出的软标签的概率分布。具体来说,教师模型的输出经过温度调整后的公式为:

In Logits distillation, the teacher model's output is a probability distribution for each token. For example, the teacher model's Logits output might be [0.7, 0.2, 0.1, 0, 0, 0, 0, 0, 0, 0]. To help the student model learn this knowledge more effectively, a temperature coefficient T is typically introduced. In Logits distillation, the temperature parameter T is a crucial hyperparameter used primarily to adjust the probability distribution of the teacher model's output soft labels. Specifically, the formula for the teacher model's output after temperature scaling is:

q_i = exp(z_i / T) / Σ_j exp(z_j / T)

其中 z_i 是教师模型的 logits 输出,在训练学生模型时,需对学生模型的输出应用与教师模型相同的温度调整,随后通过一个损失函数来评估学生模型输出与教师模型输出之间的差异。常用的损失函数是 KL散度(Kullback-Leibler Divergence),其具体公式为:

Where z_i is the logits output of the teacher model. When training the student model, the same temperature scaling must be applied to the student model's output. Then, a loss function is used to evaluate the difference between the student model's output and the teacher model's output. A commonly used loss function is the Kullback-Leibler Divergence (KL Divergence), with the specific formula:

L_KD = T^2 * KL(q || p)

其中 q 是教师模型的 Logits 输出, p 是学生模型的 Logits 输出(都引入了温度系数 T)。通过最小化这个损失函数,学生模型可以逐渐学习到教师模型的概率分布,从而获得教师模型的知识。

Where q is the teacher model's Logits output, and p is the student model's Logits output (both incorporating the temperature coefficient T). By minimizing this loss function, the student model can gradually learn the probability distribution of the teacher model, thereby acquiring its knowledge.

除了蒸馏损失,学生模型还需要学习真实标签,因此总损失函数通常是蒸馏损失和交叉熵损失的加权和:

In addition to the distillation loss, the student model also needs to learn from the true labels. Therefore, the total loss function is usually a weighted sum of the distillation loss and the cross-entropy loss:

L_total = α * L_CE + (1 - α) * L_KD

其中, α 是一个超参数,用于调整蒸馏损失和交叉熵损失之间的权重平衡。通过这种训练方式,学生模型不仅能够汲取教师模型的知识,还能保持良好的对真实标签的拟合能力,进而在实际应用中展现出优异的性能。

Where α is a hyperparameter used to adjust the balance between the distillation loss and the cross-entropy loss. Through this training method, the student model can not only absorb knowledge from the teacher model but also maintain a good ability to fit the true labels, thereby demonstrating excellent performance in practical applications.

图6: Logits蒸馏流程

Figure 6: Logits Distillation Process

特征蒸馏

如果觉得 Logits 蒸馏还不够彻底,那么还有一种方法叫特征蒸馏,也就是蒸馏教师模型的中间层。

If Logits distillation is considered insufficiently thorough, there is another method called Feature Distillation, which involves distilling the intermediate layers of the teacher model.

图7: 特征蒸馏

Figure 7: Feature Distillation

这些来自中间层的特征基知识是对 Logits 知识的有效补充,特别适合于训练更精简且更深层的网络。上图展示了基于特征的知识蒸馏过程,其中特征蒸馏损失函数可以表示为:

This feature-based knowledge from intermediate layers effectively complements Logits knowledge and is particularly suitable for training more compact and deeper networks. The figure above illustrates the feature-based knowledge distillation process, where the feature distillation loss function can be expressed as:

L_feat = || f_t(x) - f_s(x) ||^2

DeepSeek 是否蒸馏了 GPT?

最后,我们来回答最初的问题:DeepSeek是否蒸馏了GPT?我们逐一进行分析。

Finally, let's answer the initial question: Did DeepSeek distill GPT? We will analyze this point by point.

是否为特征蒸馏

显然不是。因为 OpenAI 的 GPT 模型并未开源,所以我们无法获取到其模型参数。

Obviously not. Because OpenAI's GPT models are not open-source, we cannot access their model parameters.

Is it Feature Distillation?

Clearly not. Since OpenAI's GPT models are not open-source, we cannot obtain their model parameters.

是否为 Logits 蒸馏?

确实有些模型除了输出最终结果,还会输出 Logits 分布,但是 OpenAI 的 API 并没有提供这一项功能,因此也不是 Logits 蒸馏。

While some models output Logits distributions in addition to final results, OpenAI's API does not provide this functionality. Therefore, it is not Logits distillation either.

Is it Logits Distillation?

It is true that some models output Logits distributions alongside final results. However, OpenAI's API does not offer this feature, so it is not Logits distillation.

是否为数据蒸馏

这一点可能是最具争议的。毕竟之前有传闻称 DeepSeek-R1 在被问及“我是谁”时,会回答“我是OpenAI”。同时,也确实存在一些技巧可以获取到 o1 的隐藏思维链,比如 OpenAI o1 复现——上交大提出简单蒸馏就能超过 o1?。所以,我也不确定 DeepSeek 是否数据蒸馏了 o1 模型。

This point is likely the most controversial. After all, there were rumors that DeepSeek-R1, when asked "Who am I?", would reply "I am OpenAI." Furthermore, techniques do exist to extract the hidden chain-of-thought from o1, such as OpenAI o1 Reproduction — Can Simple Distillation Surpass o1?. Therefore, I am also uncertain whether DeepSeek performed data distillation on the o1 model.

Is it Data Distillation?

This is perhaps the most debatable point. After all, there were rumors that DeepSeek-R1 would answer "I am OpenAI" when asked "Who am I?". At the same time, techniques do exist to obtain the hidden chain-of-thought from o1, for example, OpenAI o1 Reproduction — Can Simple Distillation Surpass o1?. So, I am also uncertain whether DeepSeek performed data distillation on the o1 model.

总结与参考资料

知识蒸馏作为一种高效的模型压缩与知识迁移技术,在推动大模型高效部署与应用方面扮演着关键角色。DeepSeek 的成功,无论是通过强化学习实现的“顿悟”,还是通过蒸馏技术获得的高性能小模型,都为我们提供了宝贵的技术实践参考。

As an efficient model compression and knowledge transfer technique, knowledge distillation plays a crucial role in promoting the efficient deployment and application of large models. The success of DeepSeek, whether through the "insight" achieved via reinforcement learning or the high-performance small models obtained through distillation technology, provides us with valuable references for technical practice.

参考资料 (References):

  • Knowledge Distillation: A Survey
  • DeepSeek-R1 技术报告解读 (DeepSeek-R1 Technical Report Analysis)
  • DeepSeek使用的蒸馏技术有可能用来“偷窃”吗? (Can the Distillation Technology Used by DeepSeek Be Used for "Theft"?)
  • 一文读懂 “知识蒸馏”必看! (Must-Read! Understanding "Knowledge Distillation" in One Article)
  • 一文读懂知识蒸馏技术 (Understanding Knowledge Distillation Technology in One Article)
  • 终于把深度学习中的知识蒸馏搞懂了!! (Finally Understood Knowledge Distillation in Deep Learning!!)
  • 知识蒸馏技术原理详解:从软标签到模型压缩的实现机制 (Detailed Explanation of Knowledge Distillation Principles: Implementation Mechanism from Soft Labels to Model Compression)
  • 4000字!深度解析 DeepSeek 的蒸馏技术 (4000 Words! In-depth Analysis of DeepSeek's Distillation Technology)
  • 再看增强大模型推理能力的四种范式及蒸馏微调范式具体实现 (Revisiting Four Paradigms for Enhancing Large Model Reasoning Ability and the Specific Implementation of the Distillation Fine-tuning Paradigm)
  • 图解深度学习 - 数据蒸馏知识蒸馏 (Illustrated Deep Learning - Data Distillation and Knowledge Distillation)

编辑于 2025-09-27 · 著作权归作者所有

Edited on 2025-09-27 · All copyrights belong to the author.

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。