DeepSeek是否从GPT蒸馏而来？2025知识蒸馏技术分析

引言

图1: DeepSeek 蒸馏小模型

2025年1月20日，DeepSeek 正式发布了 DeepSeek-R1，并同步开源了模型权重。此后，DeepSeek-R1 一路领先，迅速成为全球最受瞩目的大语言模型，成功超越 ChatGPT，登顶美区苹果应用商店免费 App 排行榜首位！

图1: DeepSeek Distilled Small Model

On January 20, 2025, DeepSeek officially released DeepSeek-R1 and simultaneously open-sourced its model weights. Since then, DeepSeek-R1 has surged ahead, rapidly becoming the world's most prominent large language model. It successfully surpassed ChatGPT and claimed the top spot on the U.S. Apple App Store's free app charts!

图2: DeepSeek 登顶美国下载榜第一

在之前的文章《DeepSeek-R1 技术报告解读》中，我们介绍了 DeepSeek-R1 如何通过大规模强化学习实现“顿悟”。除了强化学习技术外，蒸馏技术也是另一个值得关注的亮点。通过蒸馏小模型，DeepSeekR1-Distill-Qwen-7B 在 AIME 2024 上取得了 55.5% 的准确率，超越了 QwQ-32B-Preview。同时，DeepSeek-R1-Distill-Qwen-32B 在 AIME 2024 上的准确率高达 72.6%，在 MATH-500 上的准确率为 94.3%，在 LiveCodeBench 上的得分也达到了 57.2%。这些成绩显著优于以往的开源模型，与 o1-mini 相当。

图2: DeepSeek Tops U.S. Download Charts

In a previous article, "DeepSeek-R1 Technical Report Analysis," we discussed how DeepSeek-R1 achieved "insight" through large-scale reinforcement learning. Beyond reinforcement learning, distillation technology is another noteworthy highlight. Through model distillation, DeepSeekR1-Distill-Qwen-7B achieved a 55.5% accuracy rate on AIME 2024, surpassing QwQ-32B-Preview. Meanwhile, DeepSeek-R1-Distill-Qwen-32B achieved a high accuracy of 72.6% on AIME 2024, 94.3% on MATH-500, and a score of 57.2% on LiveCodeBench. These results are significantly better than previous open-source models and are comparable to o1-mini.

而知乎上的一个高关注问题，就是关于 DeepSeek 蒸馏相关的：许多人说DeepSeek是从GPT蒸馏出来的，这是真的吗？接下来，我们将详细介绍 DeepSeek 的蒸馏技术，相信读完之后，大家就能轻松解答这个问题了。

A highly-discussed question on Zhihu concerns DeepSeek's distillation: Many people claim DeepSeek is distilled from GPT. Is this true? Next, we will delve into the details of DeepSeek's distillation technology. We believe that after reading this, you will be able to easily answer this question.

什么是知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。？

根据维基百科的定义，知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。（knowledge distillation）是人工智能领域的一项模型训练技术。该技术透过类似于教师—学生的方式，令规模较小、结构较为简单的人工智能模型从已经经过充足训练的大型、复杂模型身上学习其掌握的知识。该技术可以让小型简单模型快速有效学习到大型复杂模型透过漫长训练才能得到的结果，从而改善模型的效率、减少运算开销，因此亦被称为模型蒸馏（model distillation）。

According to Wikipedia, Knowledge Distillation is a model training technique in the field of artificial intelligence. This technique employs a teacher-student paradigm, enabling a smaller, structurally simpler AI model to learn the knowledge possessed by a larger, complex model that has already been fully trained. This technique allows the small, simple model to quickly and effectively learn the results that the large, complex model obtained through lengthy training, thereby improving model efficiency and reducing computational costs. Hence, it is also known as Model Distillation.

知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。并不是什么新技术，早在 2006 年, Bucilua 等人最先提出将大模型的知识迁移到小模型的想法。2015 年，Hinton 正式提出广为人知的知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。的概念。其主要的想法是：学生模型通过模仿教师模型来获得和教师模型相当的精度，关键问题是如何将教师模型的知识迁移到学生模型。

Knowledge distillation is not a new technology. As early as 2006, Bucilua et al. first proposed the idea of transferring knowledge from large models to small ones. In 2015, Hinton formally introduced the widely recognized concept of knowledge distillation. The core idea is that a student model achieves accuracy comparable to the teacher model by imitating it. The key challenge lies in how to transfer the teacher model's knowledge to the student model.

图3: 知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。教师-学生架构

Figure 3: Knowledge Distillation Teacher-Student Architecture

目前最常用的蒸馏方法有三种：数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。、Logits蒸馏蒸馏方法之一，学生模型被训练以模仿教师模型在softmax函数之前的原始输出分数，保留更多信息。、特征蒸馏蒸馏方法之一，蒸馏教师模型的中间层特征，作为Logits知识的有效补充，适合训练更精简且更深层的网络。。下面分别介绍下这三种蒸馏方法。

Currently, the three most commonly used distillation methods are: Data Distillation, Logits Distillation, and Feature Distillation. We will introduce each of these methods below.

数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。

在数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。过程中，教师模型首先生成<问题, 答案> pair，这些 pair 随后被用来训练学生模型。例如，DeepSeekR1-Distill-Qwen-32B 模型就是通过使用 DeepSeek-r1 生成的80万条数据，在 Qwen2.5-32B 基础上直接进行 SFT 而得到的。

In the data distillation process, the teacher model first generates <question, answer> pairs. These pairs are then used to train the student model. For example, the DeepSeekR1-Distill-Qwen-32B model was obtained by directly performing Supervised Fine-Tuning (SFT) on the Qwen2.5-32B base model using 800,000 data points generated by DeepSeek-R1.

图4: 数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。

Figure 4: Data Distillation

Logits蒸馏蒸馏方法之一，学生模型被训练以模仿教师模型在softmax函数之前的原始输出分数，保留更多信息。

Logits 蒸馏指的是神经网络在应用 softmax 函数之前的原始输出分数。在这一过程中，学生模型被训练以模仿教师的 logits，而不仅仅是其最终预测结果，从而保留了教师模型更多的信息。

Logits distillation refers to the raw output scores of a neural network before applying the softmax function. In this process, the student model is trained to mimic the teacher's logits, not just its final predictions, thereby preserving more information from the teacher model.

图5: Logits蒸馏蒸馏方法之一，学生模型被训练以模仿教师模型在softmax函数之前的原始输出分数，保留更多信息。

Figure 5: Logits Distillation

在 Logits 蒸馏中，教师模型的输出是每个 token 的概率分布。比如教师模型的 Logits 输出为 [0.7,0.2,0.1,0,0,0,0,0,0,0]。为了让学生模型更好地学习这些知识，通常会引入温度系数Logits蒸馏中的关键超参数，用于调节教师模型输出的软标签概率分布，帮助学生模型更好地学习知识。 T。在 Logits 蒸馏中，温度参数 T 是一个关键的超参数，它主要用于调节教师模型输出的软标签的概率分布。具体来说，教师模型的输出经过温度调整后的公式为：

In Logits distillation, the teacher model's output is a probability distribution for each token. For example, the teacher model's Logits output might be [0.7, 0.2, 0.1, 0, 0, 0, 0, 0, 0, 0]. To help the student model learn this knowledge more effectively, a temperature coefficient T is typically introduced. In Logits distillation, the temperature parameter T is a crucial hyperparameter used primarily to adjust the probability distribution of the teacher model's output soft labels. Specifically, the formula for the teacher model's output after temperature scaling is:

q_i = exp(z_i / T) / Σ_j exp(z_j / T)

其中 z_i 是教师模型的 logits 输出，在训练学生模型时，需对学生模型的输出应用与教师模型相同的温度调整，随后通过一个损失函数来评估学生模型输出与教师模型输出之间的差异。常用的损失函数是 KL散度Kullback-Leibler散度，用于衡量两个概率分布之间的差异，在Logits蒸馏中作为损失函数评估学生与教师模型输出的差异。（Kullback-Leibler Divergence），其具体公式为：

Where z_i is the logits output of the teacher model. When training the student model, the same temperature scaling must be applied to the student model's output. Then, a loss function is used to evaluate the difference between the student model's output and the teacher model's output. A commonly used loss function is the Kullback-Leibler Divergence (KL Divergence), with the specific formula:

L_KD = T^2 * KL(q || p)

其中 q 是教师模型的 Logits 输出， p 是学生模型的 Logits 输出（都引入了温度系数Logits蒸馏中的关键超参数，用于调节教师模型输出的软标签概率分布，帮助学生模型更好地学习知识。 T）。通过最小化这个损失函数，学生模型可以逐渐学习到教师模型的概率分布，从而获得教师模型的知识。

Where q is the teacher model's Logits output, and p is the student model's Logits output (both incorporating the temperature coefficient T). By minimizing this loss function, the student model can gradually learn the probability distribution of the teacher model, thereby acquiring its knowledge.

除了蒸馏损失，学生模型还需要学习真实标签，因此总损失函数通常是蒸馏损失和交叉熵损失的加权和：

In addition to the distillation loss, the student model also needs to learn from the true labels. Therefore, the total loss function is usually a weighted sum of the distillation loss and the cross-entropy loss:

L_total = α * L_CE + (1 - α) * L_KD

其中， α 是一个超参数，用于调整蒸馏损失和交叉熵损失之间的权重平衡。通过这种训练方式，学生模型不仅能够汲取教师模型的知识，还能保持良好的对真实标签的拟合能力，进而在实际应用中展现出优异的性能。

Where α is a hyperparameter used to adjust the balance between the distillation loss and the cross-entropy loss. Through this training method, the student model can not only absorb knowledge from the teacher model but also maintain a good ability to fit the true labels, thereby demonstrating excellent performance in practical applications.

图6: Logits蒸馏蒸馏方法之一，学生模型被训练以模仿教师模型在softmax函数之前的原始输出分数，保留更多信息。流程

Figure 6: Logits Distillation Process

特征蒸馏蒸馏方法之一，蒸馏教师模型的中间层特征，作为Logits知识的有效补充，适合训练更精简且更深层的网络。

如果觉得 Logits 蒸馏还不够彻底，那么还有一种方法叫特征蒸馏蒸馏方法之一，蒸馏教师模型的中间层特征，作为Logits知识的有效补充，适合训练更精简且更深层的网络。，也就是蒸馏教师模型的中间层。

If Logits distillation is considered insufficiently thorough, there is another method called Feature Distillation, which involves distilling the intermediate layers of the teacher model.

图7: 特征蒸馏蒸馏方法之一，蒸馏教师模型的中间层特征，作为Logits知识的有效补充，适合训练更精简且更深层的网络。

Figure 7: Feature Distillation

这些来自中间层的特征基知识是对 Logits 知识的有效补充，特别适合于训练更精简且更深层的网络。上图展示了基于特征的知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。过程，其中特征蒸馏蒸馏方法之一，蒸馏教师模型的中间层特征，作为Logits知识的有效补充，适合训练更精简且更深层的网络。损失函数可以表示为：

This feature-based knowledge from intermediate layers effectively complements Logits knowledge and is particularly suitable for training more compact and deeper networks. The figure above illustrates the feature-based knowledge distillation process, where the feature distillation loss function can be expressed as:

L_feat = || f_t(x) - f_s(x) ||^2

DeepSeek 是否蒸馏了 GPT？

最后，我们来回答最初的问题：DeepSeek是否蒸馏了GPT？我们逐一进行分析。

Finally, let's answer the initial question: Did DeepSeek distill GPT? We will analyze this point by point.

是否为特征蒸馏蒸馏方法之一，蒸馏教师模型的中间层特征，作为Logits知识的有效补充，适合训练更精简且更深层的网络。？

显然不是。因为 OpenAI 的 GPT 模型并未开源，所以我们无法获取到其模型参数。

Obviously not. Because OpenAI's GPT models are not open-source, we cannot access their model parameters.

Is it Feature Distillation?

Clearly not. Since OpenAI's GPT models are not open-source, we cannot obtain their model parameters.

是否为 Logits 蒸馏？

确实有些模型除了输出最终结果，还会输出 Logits 分布，但是 OpenAI 的 API 并没有提供这一项功能，因此也不是 Logits 蒸馏。

While some models output Logits distributions in addition to final results, OpenAI's API does not provide this functionality. Therefore, it is not Logits distillation either.

Is it Logits Distillation?

It is true that some models output Logits distributions alongside final results. However, OpenAI's API does not offer this feature, so it is not Logits distillation.

是否为数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。？

这一点可能是最具争议的。毕竟之前有传闻称 DeepSeek-R1 在被问及“我是谁”时，会回答“我是OpenAI”。同时，也确实存在一些技巧可以获取到 o1 的隐藏思维链，比如 OpenAI o1 复现——上交大提出简单蒸馏就能超过 o1？。所以，我也不确定 DeepSeek 是否数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。了 o1 模型。

This point is likely the most controversial. After all, there were rumors that DeepSeek-R1, when asked "Who am I?", would reply "I am OpenAI." Furthermore, techniques do exist to extract the hidden chain-of-thought from o1, such as OpenAI o1 Reproduction — Can Simple Distillation Surpass o1?. Therefore, I am also uncertain whether DeepSeek performed data distillation on the o1 model.

Is it Data Distillation?

This is perhaps the most debatable point. After all, there were rumors that DeepSeek-R1 would answer "I am OpenAI" when asked "Who am I?". At the same time, techniques do exist to obtain the hidden chain-of-thought from o1, for example, OpenAI o1 Reproduction — Can Simple Distillation Surpass o1?. So, I am also uncertain whether DeepSeek performed data distillation on the o1 model.

总结与参考资料

知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。作为一种高效的模型压缩与知识迁移技术，在推动大模型高效部署与应用方面扮演着关键角色。DeepSeek 的成功，无论是通过强化学习实现的“顿悟”，还是通过蒸馏技术获得的高性能小模型，都为我们提供了宝贵的技术实践参考。

As an efficient model compression and knowledge transfer technique, knowledge distillation plays a crucial role in promoting the efficient deployment and application of large models. The success of DeepSeek, whether through the "insight" achieved via reinforcement learning or the high-performance small models obtained through distillation technology, provides us with valuable references for technical practice.

参考资料 (References):

Knowledge Distillation: A Survey
DeepSeek-R1 技术报告解读 (DeepSeek-R1 Technical Report Analysis)
DeepSeek使用的蒸馏技术有可能用来“偷窃”吗？ (Can the Distillation Technology Used by DeepSeek Be Used for "Theft"?)
一文读懂 “知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。”必看！ (Must-Read! Understanding "Knowledge Distillation" in One Article)
一文读懂知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。技术 (Understanding Knowledge Distillation Technology in One Article)
终于把深度学习中的知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。搞懂了！！ (Finally Understood Knowledge Distillation in Deep Learning!!)
知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。技术原理详解：从软标签到模型压缩的实现机制 (Detailed Explanation of Knowledge Distillation Principles: Implementation Mechanism from Soft Labels to Model Compression)
4000字！深度解析 DeepSeek 的蒸馏技术 (4000 Words! In-depth Analysis of DeepSeek's Distillation Technology)
再看增强大模型推理能力的四种范式及蒸馏微调范式具体实现 (Revisiting Four Paradigms for Enhancing Large Model Reasoning Ability and the Specific Implementation of the Distillation Fine-tuning Paradigm)
图解深度学习 - 数据蒸馏蒸馏方法之一，教师模型首先生成<问题,答案>对，然后用这些数据训练学生模型。和知识蒸馏一种机器学习技术，其中较小的"学生"模型通过模仿较大"教师"模型的输出和推理过程来学习，可以显著降低训练成本并提高模型效率。 (Illustrated Deep Learning - Data Distillation and Knowledge Distillation)

编辑于 2025-09-27 · 著作权归作者所有

Edited on 2025-09-27 · All copyrights belong to the author.