DeepGEMM 相比其他矩阵计算库有哪些核心优势？

DeepGEMM 专为 NVIDIA Hopper GPU 优化，核心代码仅约 300 行，通过即时编译和线程专业化，在大多数矩阵尺寸下性能优于现有方案，并支持标准与混合专家模型计算。

DeepGEMM 如何解决 FP8 计算中的精度问题？

DeepGEMM 专门优化了 FP8 精度的矩阵乘法，同时解决了 Hopper 架构在处理 FP8 计算时可能出现的精度问题，确保在提升速度的同时计算结果准确可靠。

DeepGEMM 对混合专家模型计算提供了哪些特殊支持？

DeepGEMM 支持两种专为混合专家模型设计的特殊数据排列方式，优化了门控网络分配输入数据给部分专家处理的计算流程，提升 MoE 模型的训练和推理效率。

DeepGEMM 相比其他矩阵计算库有哪些核心优势？

DeepGEMM 专为 NVIDIA Hopper GPU 优化，核心代码仅约 300 行，通过即时编译和线程专业化，在大多数矩阵尺寸下性能优于现有方案，并支持标准与混合专家模型计算。

DeepGEMM 如何解决 FP8 计算中的精度问题？

DeepGEMM 专门优化了 FP8 精度的矩阵乘法，同时解决了 Hopper 架构在处理 FP8 计算时可能出现的精度问题，确保在提升速度的同时计算结果准确可靠。

DeepGEMM 对混合专家模型计算提供了哪些特殊支持？

DeepGEMM 支持两种专为混合专家模型设计的特殊数据排列方式，优化了门控网络分配输入数据给部分专家处理的计算流程，提升 MoE 模型的训练和推理效率。

DeepGEMM 矩阵计算库在 Hopper GPU 上性能如何？（实测 1350+ FP8 TFLOPS）

引言

它来了，我们的“源神” DeepSeek 它又来了。

Here it comes again, our "Source God" DeepSeek has arrived once more.

在 DeepSeek 开源周的第三天，团队带来了专为 Hopper 架构 GPU 优化的矩阵乘法库——DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语。这一库支持标准矩阵计算和混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算，为 DeepSeek-V3/R1 的训练和推理提供强大支持，在 Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。上达到 1350+ FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. TFLOPS 的高性能。

On the third day of the DeepSeek Open Source Week, the team introduced DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语, a matrix multiplication library specifically optimized for Hopper architecture GPUs. This library supports both standard matrix computations and Mixture-of-Experts (MoE) model calculations, providing robust support for the training and inference of DeepSeek-V3/R1, achieving high performance of 1350+ FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. TFLOPS on Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。s.

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语的设计理念是简洁高效，核心代码仅约 300 行，同时在大多数矩阵尺寸下性能优于现有解决方案。该库支持三种数据排列方式：标准排列和两种专为混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。设计的特殊排列。DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语采用即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。技术，不需要在安装时进行编译，代码结构清晰易懂，非常适合学习 GPU 优化技术。

The design philosophy of DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 is simplicity and efficiency, with its core code comprising only about 300 lines, while outperforming existing solutions for most matrix sizes. The library supports three data layout modes: a standard layout and two special layouts designed for Mixture-of-Experts models. DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 employs Just-In-Time (JIT) compilation technology, eliminating the need for compilation during installation. Its code structure is clear and easy to understand, making it an excellent resource for learning GPU optimization techniques.

核心概念解析

什么是 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 和 GEMM通用矩阵乘法，深度学习中最基础且常见的计算操作，涉及两个数据表格的相乘。？

在计算机中，数值需要用二进制位存储，存储方式决定了精度和所需空间。传统上，AI 计算使用 32 位浮点数，这提供了很高的精度，但占用较多存储空间和计算资源。

In computing, numerical values are stored using binary bits. The storage format determines precision and required space. Traditionally, AI computations use 32-bit floating-point numbers, which offer high precision but consume significant storage space and computational resources.

研究表明，很多 AI 任务实际上不需要这么高的精度。16 位浮点数已被广泛采用，而 8 位浮点数是更进一步的精度降低。虽然 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 精度较低，但对许多 AI 任务已经足够，同时能大大减少内存使用并提高计算速度。这就像用较粗的刻度测量大物体，虽然精度降低，但速度快得多，且在大多数情况下已经足够准确。

Research indicates that many AI tasks do not actually require such high precision. 16-bit floating-point numbers have been widely adopted, and 8-bit floating-point represents a further reduction in precision. Although FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. offers lower precision, it is sufficient for many AI tasks while significantly reducing memory usage and increasing computational speed. This is akin to measuring a large object with a coarser scale: while precision is reduced, the speed is much faster, and it remains sufficiently accurate for most scenarios.

GEMM通用矩阵乘法，深度学习中最基础且常见的计算操作，涉及两个数据表格的相乘。是深度学习中最基础也最常见的计算操作。简单来说，它计算两个数据表格相乘的结果。这看似简单，但在 AI 计算中，这些矩阵可能非常庞大，含有数百万个元素，使得矩阵乘法成为整个系统中最耗时的部分之一。几乎所有神经网络层的计算本质上都包含矩阵乘法操作。

GEMM通用矩阵乘法，深度学习中最基础且常见的计算操作，涉及两个数据表格的相乘。 (General Matrix Multiply) is the most fundamental and common computational operation in deep learning. Simply put, it calculates the product of two data matrices. While this seems straightforward, in AI computations, these matrices can be enormous, containing millions of elements, making matrix multiplication one of the most time-consuming parts of the entire system. Essentially, the computation of almost all neural network layers involves matrix multiplication operations.

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语专门优化了 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 精度的矩阵乘法，同时解决了 Hopper 架构在处理 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算时可能出现的精度问题，确保计算结果准确可靠。

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 specifically optimizes FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks.-precision matrix multiplication while addressing potential precision issues that may arise when the Hopper architecture handles FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. computations, ensuring accurate and reliable calculation results.

标准矩阵乘法与混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算

标准矩阵乘法处理的是完整矩阵之间的运算，适用于传统神经网络架构，所有数据都经过统一处理。

Standard matrix multiplication deals with operations between complete matrices and is suitable for traditional neural network architectures where all data is processed uniformly.

而混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。是一种特殊的神经网络架构，它包含多个“专家”网络和一个“门控”网络。门控网络负责决定将输入数据分配给哪些专家处理，而不是所有数据都经过所有专家。这种方法允许模型规模大幅增长，同时保持计算效率，因为每次处理只激活部分模型而非全部。

In contrast, the Mixture-of-Experts (MoE) model is a special neural network architecture that consists of multiple "expert" networks and a "gating" network. The gating network is responsible for deciding which experts to assign the input data to for processing, rather than having all data pass through all experts. This approach allows the model size to grow significantly while maintaining computational efficiency, as only a portion of the model is activated for each processing step, not the entire model.

针对 MoE 模型，DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语提供了两种特殊数据排列方式：

For MoE models, DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 provides two special data layout modes:

连续排列：适用于训练和批量推理，将不同专家处理的数据连接成单一数据块；

Contiguous Layout: Suitable for training and batch inference, concatenating data processed by different experts into a single data block;
掩码排列：适用于实时推理，通过标记指示哪些数据需要处理，特别适合与 CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。图技术结合使用。

Masked Layout: Suitable for real-time inference, using markers to indicate which data needs processing, particularly well-suited for integration with CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。 Graph technology.

性能表现

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语在各种计算场景下表现出色。对于标准矩阵乘法，与基于 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 3.6 的优化实现相比，速度提升 1.0 到 2.7 倍不等。小批量数据处理获得了最显著的加速，最高达到 2.7 倍。

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 demonstrates excellent performance across various computational scenarios. For standard matrix multiplication, compared to optimized implementations based on CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 3.6, speed improvements range from 1.0x to 2.7x. The most significant acceleration is achieved in small-batch data processing, with a maximum speedup of 2.7x.

对于混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。的计算，DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语提供的两种特殊数据排列方式也有明显优势。连续排列方式适用于训练和批量推理阶段，速度提升约 1.1 到 1.2 倍；掩码排列方式专为实时推理设计，支持与 CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。图技术配合使用，同样能提速 1.1 到 1.2 倍。

For Mixture-of-Experts model computations, the two special data layout modes provided by DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 also show clear advantages. The contiguous layout, suitable for training and batch inference stages, offers speed improvements of approximately 1.1x to 1.2x. The masked layout, designed specifically for real-time inference and supporting integration with CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。 Graph technology, similarly achieves speedups of 1.1x to 1.2x.

考虑到官方的数据不太易读，我们重新整理了核心性能对比数据如下：

Considering that the official data may not be easily readable, we have reorganized the core performance comparison data as follows:

标准矩阵乘法性能对比 (FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks., Hopper H100)

下表展示了 DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语与 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 3.6 在标准 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 矩阵乘法上的性能对比（TFLOPS）。数值越高越好，加粗表示 DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语领先。

The table below shows the performance comparison (TFLOPS) between DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 and CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 3.6 on standard FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. matrix multiplication. Higher values are better, with bold indicating DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 leads.


矩阵尺寸 (MxNxK)	Batch Size	DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 (TFLOPS)	CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 3.6 (TFLOPS)	加速比
256x7168x2816	1	~1350	~500	~2.7x
256x7168x2816	16	~1350	~1250	~1.08x
4096x4096x4096	1	~1300	~1280	~1.02x
小尺寸平均	1-16	显著领先	-	最高 2.7x

混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算性能对比

下表对比了 DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语为 MoE 模型提供的两种特殊布局与基线实现的性能。

The table below compares the performance of the two special layouts provided by DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 for MoE models against baseline implementations.


计算类型	数据布局	适用场景	DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语加速比	关键技术
MoE 计算	连续排列 (Contiguous)	训练 / 批量推理	1.1x - 1.2x	数据块连续化
MoE 计算	掩码排列 (Masked)	实时推理	1.1x - 1.2x	掩码标记，CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。 Graph 友好

底层架构与技术深度

Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。与张量核心GPU 内部的特殊计算单元，专门针对矩阵运算进行优化，能大幅加速深度学习计算。

NVIDIA 的 Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。是专为人工智能和高性能计算设计的最新硬件平台，提供了多项关键技术改进：

NVIDIA's Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。 is the latest hardware platform specifically designed for artificial intelligence and high-performance computing, offering several key technological improvements:

张量核心GPU 内部的特殊计算单元，专门针对矩阵运算进行优化，能大幅加速深度学习计算。是 GPU 内部的特殊计算单元，专门针对矩阵运算进行了优化，能大幅加速深度学习计算。Hopper 架构的张量核心GPU 内部的特殊计算单元，专门针对矩阵运算进行优化，能大幅加速深度学习计算。支持 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算，比前代产品提供更高性能。

Tensor Cores are specialized computational units within the GPU, optimized specifically for matrix operations, capable of significantly accelerating deep learning computations. The Tensor Cores in the Hopper architecture support FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. computation, providing higher performance than previous generations.

TMAHopper 架构引入的新功能，用于更快速、异步地移动数据，DeepGEMM 充分利用此技术提升性能。是 Hopper 架构引入的新功能，用于更快速、异步地移动数据。DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语充分利用 TMAHopper 架构引入的新功能，用于更快速、异步地移动数据，DeepGEMM 充分利用此技术提升性能。技术加载和存储数据，并使用 TMAHopper 架构引入的新功能，用于更快速、异步地移动数据，DeepGEMM 充分利用此技术提升性能。多播和描述符预取等高级功能进一步提升性能。

TMAHopper 架构引入的新功能，用于更快速、异步地移动数据，DeepGEMM 充分利用此技术提升性能。 (Tensor Memory Accelerator) is a new feature introduced in the Hopper architecture for faster, asynchronous data movement. DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 fully leverages TMAHopper 架构引入的新功能，用于更快速、异步地移动数据，DeepGEMM 充分利用此技术提升性能。 technology for loading and storing data and utilizes advanced features like TMAHopper 架构引入的新功能，用于更快速、异步地移动数据，DeepGEMM 充分利用此技术提升性能。 multicast and descriptor prefetching to further enhance performance.

即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。技术

即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。是一种程序在运行时才进行编译的技术，而非传统的在安装或部署时预先编译。DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语采用完全即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。设计，所有计算内核都在实际运行时进行编译，这带来几个优势：

Just-In-Time (JIT) compilation is a technique where programs are compiled at runtime, rather than being pre-compiled during installation or deployment. DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 adopts a fully JIT-compiled design, where all computational kernels are compiled during actual runtime. This offers several advantages:

可以将矩阵形状、块大小等作为编译时常量处理，从而节省计算资源并允许更多编译优化；

Matrix shapes, block sizes, etc., can be treated as compile-time constants, saving computational resources and enabling more compiler optimizations;
自动为当前任务选择最佳参数配置，而无需人工调整；

Automatically selects the optimal parameter configuration for the current task without manual adjustment;
完全展开计算流水线，让编译器有更多优化空间，特别有利于处理小规模矩阵。

Fully unrolls the computation pipeline, giving the compiler more optimization space, which is particularly beneficial for handling small-scale matrices.

这种即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。方法显著提高了小矩阵形状的计算性能，技术思路类似于 Triton 等现代编译器。

This JIT compilation approach significantly improves computational performance for small matrix shapes, following a technical philosophy similar to modern compilers like Triton.

CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。与 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。

CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。是 NVIDIA 开发的并行计算平台和编程模型，允许开发者利用 GPU 强大的并行处理能力。这是编写 GPU 程序的基础工具。

CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。 is a parallel computing platform and programming model developed by NVIDIA, allowing developers to leverage the powerful parallel processing capabilities of GPUs. It is the foundational tool for writing GPU programs.

CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。是 NVIDIA 的开源矩阵乘法库，提供了高性能的矩阵计算模板。DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语借鉴了 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。的一些思路，但没有直接依赖其复杂的模板系统，而是自行实现了一套更简洁的代码，既保证性能又易于理解和学习。

CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 is NVIDIA's open-source matrix multiplication library, providing high-performance matrix computation templates. DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 draws inspiration from some concepts in CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。 but does not directly depend on its complex template system. Instead, it implements its own, more concise codebase, ensuring performance while remaining easy to understand and learn from.

线程专业化一种高效的任务分工方法，不同的计算线程被分配专门负责特定任务（如数据移动、核心计算、结果处理），形成高效流水线。技术

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语采用了线程专业化一种高效的任务分工方法，不同的计算线程被分配专门负责特定任务（如数据移动、核心计算、结果处理），形成高效流水线。技术，这是一种高效的任务分工方法。在这种设计中，不同的计算线程被分配专门负责特定任务：一些负责数据移动，一些负责核心计算，一些负责结果处理。

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 employs thread specialization technology, an efficient method of task division. In this design, different computational threads are assigned specific responsibilities: some handle data movement, some handle core computations, and some handle result processing.

这种分工使得数据移动、计算和后处理能够同时进行，形成高效的流水线，大大提高整体性能。

This division of labor allows data movement, computation, and post-processing to occur simultaneously, forming an efficient pipeline that significantly improves overall performance.

核心技术创新点

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语包含多项先进技术创新：

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 incorporates several advanced technological innovations:

非标准块大小

传统上，GPU 计算通常使用标准大小的数据块。DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语支持非标准块大小，这能更好地适应特定矩阵形状，提高硬件资源利用率。例如，对于 M=256，N=7168 的矩阵，标准块大小只能利用 112 个计算单元，而使用非标准块大小可以利用 128 个，效率提升明显。

Traditionally, GPU computations often use standard-sized data blocks. DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 supports non-standard block sizes, which can better adapt to specific matrix shapes and improve hardware resource utilization. For example, for a matrix of M=256, N=7168, standard block sizes can only utilize 112 computational units, whereas using non-standard block sizes can utilize 128 units, resulting in a noticeable efficiency improvement.

指令级优化

通过分析不同编译器版本产生的机器代码，DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语团队发现并实现了特殊的指令排序优化。这种底层优化调整了计算指令的执行方式，使计算单元能更高效地并行工作，显著提升了 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算性能。

By analyzing machine code generated by different compiler versions, the DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 team discovered and implemented special instruction scheduling optimizations. This low-level optimization adjusts the execution order of computational instructions, enabling computational units to work in parallel more efficiently and significantly boosting FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. computation performance.

统一调度系统

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语设计了一套统一的计算任务调度系统，采用特殊的排布策略，增强缓存重用效率，减少内存访问，提高整体性能。

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 has designed a unified computational task scheduling system that employs special arrangement strategies to enhance cache reuse efficiency, reduce memory access, and improve overall performance.

使用指南

系统要求

使用 DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语需要满足以下软硬件要求：

Using DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 requires meeting the following software and hardware requirements:


组件	最低要求	推荐版本	说明
GPU	Hopper 架构 (sm_90a)	NVIDIA H100	必须支持 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. Tensor Core
Python	3.8	3.10+	-
CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。	12.3	12.4+	-
PyTorch	2.1	2.3+	-
CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。	3.6	3.6+	作为参考实现

安装与开发

Development (开发模式)

Development

# Submodule must be cloned
git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git
# Make symbolic links for third-party (CUTLASS and CuTe) include directories
python setup.py develop
# Test JIT compilation
python tests/test_jit.py
# Test all GEMM implements (normal, contiguous-grouped and masked-grouped)
python tests/test_core.py

Installation (安装)

Installation

python setup.py install

最后， import deep_gemm 就行了。

Finally, simply import deep_gemm.

API 概览

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语提供了清晰的 Python 编程接口，包括：

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 provides a clear Python programming interface, including:

标准矩阵乘法函数：用于普通神经网络计算；

Standard matrix multiplication functions: for conventional neural network computations;
连续排列分组函数：用于混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。训练和批量推理；

Contiguous layout grouping functions:

常见问题（FAQ）

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语相比其他矩阵计算库有哪些核心优势？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语专为 NVIDIA Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。优化，核心代码仅约 300 行，通过即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。和线程专业化一种高效的任务分工方法，不同的计算线程被分配专门负责特定任务（如数据移动、核心计算、结果处理），形成高效流水线。，在大多数矩阵尺寸下性能优于现有方案，并支持标准与混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算。

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语如何解决 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算中的精度问题？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语对混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算提供了哪些特殊支持？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语支持两种专为混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。设计的特殊数据排列方式，优化了门控网络分配输入数据给部分专家处理的计算流程，提升 MoE 模型的训练和推理效率。

DeepSeek开源的DeepGEMM 矩阵计算库在 Hopper GPU 上性能如何？（实测 1350+ FP8 TFLOPS）

AI Summary (BLUF)

引言

核心概念解析

什么是 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 和 GEMM通用矩阵乘法，深度学习中最基础且常见的计算操作，涉及两个数据表格的相乘。？

标准矩阵乘法与混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算

性能表现

标准矩阵乘法性能对比 (FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks., Hopper H100)

混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算性能对比

底层架构与技术深度

Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。与张量核心GPU 内部的特殊计算单元，专门针对矩阵运算进行优化，能大幅加速深度学习计算。

即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。技术

CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。与 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。

线程专业化一种高效的任务分工方法，不同的计算线程被分配专门负责特定任务（如数据移动、核心计算、结果处理），形成高效流水线。技术

核心技术创新点

非标准块大小

指令级优化

统一调度系统

使用指南

系统要求

安装与开发

API 概览

常见问题（FAQ）

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语相比其他矩阵计算库有哪些核心优势？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语如何解决 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算中的精度问题？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语对混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算提供了哪些特殊支持？

AI Summary (BLUF)

引言

核心概念解析

什么是 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 和 GEMM通用矩阵乘法，深度学习中最基础且常见的计算操作，涉及两个数据表格的相乘。？

标准矩阵乘法与混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算

性能表现

标准矩阵乘法性能对比 (FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks., Hopper H100)

混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算性能对比

底层架构与技术深度

Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。 与张量核心GPU 内部的特殊计算单元，专门针对矩阵运算进行优化，能大幅加速深度学习计算。

即时编译一种程序在运行时才进行编译的技术，DeepGEMM 采用此设计，所有计算内核都在实际运行时编译，以提高性能。技术

CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。 与 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。

线程专业化一种高效的任务分工方法，不同的计算线程被分配专门负责特定任务（如数据移动、核心计算、结果处理），形成高效流水线。技术

核心技术创新点

非标准块大小

指令级优化

统一调度系统

使用指南

系统要求

安装与开发

API 概览

常见问题（FAQ）

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 相比其他矩阵计算库有哪些核心优势？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 如何解决 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算中的精度问题？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语 对混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算提供了哪些特殊支持？

Hopper GPUNVIDIA的GPU架构，如H800，具有高级Tensor Cores和Transformer Engines，支持高性能AI计算。与张量核心GPU 内部的特殊计算单元，专门针对矩阵运算进行优化，能大幅加速深度学习计算。

CUDA英伟达开发的并行计算平台和编程模型，专门用于利用GPU进行通用计算加速。与 CUTLASSNVIDIA 的开源矩阵乘法库，提供高性能的矩阵计算模板，DeepGEMM 借鉴了其思路但实现更简洁。

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语相比其他矩阵计算库有哪些核心优势？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语如何解决 FP88-bit floating-point precision, a numerical format used for efficient training and inference of large neural networks. 计算中的精度问题？

DeepGEMMDeepSeek开源的高性能张量核心内核库，专注于为大语言模型提供核心计算原语对混合专家模型将多个专家网络组合的模型结构，每个专家处理特定类型的输入，提高模型容量和效率。计算提供了哪些特殊支持？