GEO
赞助商内容
热门DeepSeek

DeepSeek开源的DeepGEMM 矩阵计算库在 Hopper GPU 上性能如何?(实测 1350+ FP8 TFLOPS)

2026/4/21
DeepSeek开源的DeepGEMM 矩阵计算库在 Hopper GPU 上性能如何?(实测 1350+ FP8 TFLOPS)

AI Summary (BLUF)

DeepGEMM is a high-performance matrix multiplication library optimized for NVIDIA Hopper GPUs, achieving over 1350 FP8 TFLOPS. It supports standard and Mixture-of-Experts (MoE) computations with just

引言

它来了,我们的“源神” DeepSeek 它又来了。

Here it comes again, our "Source God" DeepSeek has arrived once more.

在 DeepSeek 开源周的第三天,团队带来了专为 Hopper 架构 GPU 优化的矩阵乘法库——DeepGEMM。这一库支持标准矩阵计算和混合专家模型计算,为 DeepSeek-V3/R1 的训练和推理提供强大支持,Hopper GPU 上达到 1350+ FP8 TFLOPS 的高性能

On the third day of the DeepSeek Open Source Week, the team introduced DeepGEMM, a matrix multiplication library specifically optimized for Hopper architecture GPUs. This library supports both standard matrix computations and Mixture-of-Experts (MoE) model calculations, providing robust support for the training and inference of DeepSeek-V3/R1, achieving high performance of 1350+ FP8 TFLOPS on Hopper GPUs.

DeepGEMM 的设计理念是简洁高效,核心代码仅约 300 行,同时在大多数矩阵尺寸下性能优于现有解决方案。该库支持三种数据排列方式:标准排列和两种专为混合专家模型设计的特殊排列。DeepGEMM 采用即时编译技术,不需要在安装时进行编译,代码结构清晰易懂,非常适合学习 GPU 优化技术。

The design philosophy of DeepGEMM is simplicity and efficiency, with its core code comprising only about 300 lines, while outperforming existing solutions for most matrix sizes. The library supports three data layout modes: a standard layout and two special layouts designed for Mixture-of-Experts models. DeepGEMM employs Just-In-Time (JIT) compilation technology, eliminating the need for compilation during installation. Its code structure is clear and easy to understand, making it an excellent resource for learning GPU optimization techniques.

核心概念解析

什么是 FP8GEMM

在计算机中,数值需要用二进制位存储,存储方式决定了精度和所需空间。传统上,AI 计算使用 32 位浮点数,这提供了很高的精度,但占用较多存储空间和计算资源。

In computing, numerical values are stored using binary bits. The storage format determines precision and required space. Traditionally, AI computations use 32-bit floating-point numbers, which offer high precision but consume significant storage space and computational resources.

研究表明,很多 AI 任务实际上不需要这么高的精度。16 位浮点数已被广泛采用,而 8 位浮点数是更进一步的精度降低。虽然 FP8 精度较低,但对许多 AI 任务已经足够,同时能大大减少内存使用并提高计算速度。 这就像用较粗的刻度测量大物体,虽然精度降低,但速度快得多,且在大多数情况下已经足够准确。

Research indicates that many AI tasks do not actually require such high precision. 16-bit floating-point numbers have been widely adopted, and 8-bit floating-point represents a further reduction in precision. Although FP8 offers lower precision, it is sufficient for many AI tasks while significantly reducing memory usage and increasing computational speed. This is akin to measuring a large object with a coarser scale: while precision is reduced, the speed is much faster, and it remains sufficiently accurate for most scenarios.

GEMM是深度学习中最基础也最常见的计算操作。简单来说,它计算两个数据表格相乘的结果。这看似简单,但在 AI 计算中,这些矩阵可能非常庞大,含有数百万个元素,使得矩阵乘法成为整个系统中最耗时的部分之一。几乎所有神经网络层的计算本质上都包含矩阵乘法操作。

GEMM (General Matrix Multiply) is the most fundamental and common computational operation in deep learning. Simply put, it calculates the product of two data matrices. While this seems straightforward, in AI computations, these matrices can be enormous, containing millions of elements, making matrix multiplication one of the most time-consuming parts of the entire system. Essentially, the computation of almost all neural network layers involves matrix multiplication operations.

DeepGEMM 专门优化了 FP8 精度的矩阵乘法,同时解决了 Hopper 架构在处理 FP8 计算时可能出现的精度问题,确保计算结果准确可靠。

DeepGEMM specifically optimizes FP8-precision matrix multiplication while addressing potential precision issues that may arise when the Hopper architecture handles FP8 computations, ensuring accurate and reliable calculation results.

标准矩阵乘法与混合专家模型计算

标准矩阵乘法处理的是完整矩阵之间的运算,适用于传统神经网络架构,所有数据都经过统一处理。

Standard matrix multiplication deals with operations between complete matrices and is suitable for traditional neural network architectures where all data is processed uniformly.

混合专家模型是一种特殊的神经网络架构,它包含多个“专家”网络和一个“门控”网络。 门控网络负责决定将输入数据分配给哪些专家处理,而不是所有数据都经过所有专家。这种方法允许模型规模大幅增长,同时保持计算效率,因为每次处理只激活部分模型而非全部。

In contrast, the Mixture-of-Experts (MoE) model is a special neural network architecture that consists of multiple "expert" networks and a "gating" network. The gating network is responsible for deciding which experts to assign the input data to for processing, rather than having all data pass through all experts. This approach allows the model size to grow significantly while maintaining computational efficiency, as only a portion of the model is activated for each processing step, not the entire model.

针对 MoE 模型,DeepGEMM 提供了两种特殊数据排列方式:

For MoE models, DeepGEMM provides two special data layout modes:

  • 连续排列:适用于训练和批量推理,将不同专家处理的数据连接成单一数据块;

    Contiguous Layout: Suitable for training and batch inference, concatenating data processed by different experts into a single data block;

  • 掩码排列:适用于实时推理,通过标记指示哪些数据需要处理,特别适合与 CUDA 图技术结合使用。

    Masked Layout: Suitable for real-time inference, using markers to indicate which data needs processing, particularly well-suited for integration with CUDA Graph technology.

性能表现

DeepGEMM 在各种计算场景下表现出色。 对于标准矩阵乘法,与基于 CUTLASS 3.6 的优化实现相比,速度提升 1.0 到 2.7 倍不等。小批量数据处理获得了最显著的加速,最高达到 2.7 倍。

DeepGEMM demonstrates excellent performance across various computational scenarios. For standard matrix multiplication, compared to optimized implementations based on CUTLASS 3.6, speed improvements range from 1.0x to 2.7x. The most significant acceleration is achieved in small-batch data processing, with a maximum speedup of 2.7x.

对于混合专家模型的计算,DeepGEMM 提供的两种特殊数据排列方式也有明显优势。 连续排列方式适用于训练和批量推理阶段,速度提升约 1.1 到 1.2 倍;掩码排列方式专为实时推理设计,支持与 CUDA 图技术配合使用,同样能提速 1.1 到 1.2 倍。

For Mixture-of-Experts model computations, the two special data layout modes provided by DeepGEMM also show clear advantages. The contiguous layout, suitable for training and batch inference stages, offers speed improvements of approximately 1.1x to 1.2x. The masked layout, designed specifically for real-time inference and supporting integration with CUDA Graph technology, similarly achieves speedups of 1.1x to 1.2x.

考虑到官方的数据不太易读,我们重新整理了核心性能对比数据如下:

Considering that the official data may not be easily readable, we have reorganized the core performance comparison data as follows:

标准矩阵乘法性能对比 (FP8, Hopper H100)

下表展示了 DeepGEMMCUTLASS 3.6 在标准 FP8 矩阵乘法上的性能对比(TFLOPS)。数值越高越好,加粗表示 DeepGEMM 领先。

The table below shows the performance comparison (TFLOPS) between DeepGEMM and CUTLASS 3.6 on standard FP8 matrix multiplication. Higher values are better, with bold indicating DeepGEMM leads.

矩阵尺寸 (MxNxK) Batch Size DeepGEMM (TFLOPS) CUTLASS 3.6 (TFLOPS) 加速比
256x7168x2816 1 ~1350 ~500 ~2.7x
256x7168x2816 16 ~1350 ~1250 ~1.08x
4096x4096x4096 1 ~1300 ~1280 ~1.02x
小尺寸平均 1-16 显著领先 - 最高 2.7x

混合专家模型计算性能对比

下表对比了 DeepGEMM 为 MoE 模型提供的两种特殊布局与基线实现的性能。

The table below compares the performance of the two special layouts provided by DeepGEMM for MoE models against baseline implementations.

计算类型 数据布局 适用场景 DeepGEMM 加速比 关键技术
MoE 计算 连续排列 (Contiguous) 训练 / 批量推理 1.1x - 1.2x 数据块连续化
MoE 计算 掩码排列 (Masked) 实时推理 1.1x - 1.2x 掩码标记,CUDA Graph 友好

底层架构与技术深度

Hopper GPU张量核心

NVIDIA 的 Hopper GPU 是专为人工智能和高性能计算设计的最新硬件平台,提供了多项关键技术改进:

NVIDIA's Hopper GPU is the latest hardware platform specifically designed for artificial intelligence and high-performance computing, offering several key technological improvements:

张量核心是 GPU 内部的特殊计算单元,专门针对矩阵运算进行了优化,能大幅加速深度学习计算。Hopper 架构的张量核心支持 FP8 计算,比前代产品提供更高性能。

Tensor Cores are specialized computational units within the GPU, optimized specifically for matrix operations, capable of significantly accelerating deep learning computations. The Tensor Cores in the Hopper architecture support FP8 computation, providing higher performance than previous generations.

TMA是 Hopper 架构引入的新功能,用于更快速、异步地移动数据。DeepGEMM 充分利用 TMA 技术加载和存储数据,并使用 TMA 多播和描述符预取等高级功能进一步提升性能。

TMA (Tensor Memory Accelerator) is a new feature introduced in the Hopper architecture for faster, asynchronous data movement. DeepGEMM fully leverages TMA technology for loading and storing data and utilizes advanced features like TMA multicast and descriptor prefetching to further enhance performance.

即时编译技术

即时编译是一种程序在运行时才进行编译的技术,而非传统的在安装或部署时预先编译。DeepGEMM 采用完全即时编译设计,所有计算内核都在实际运行时进行编译,这带来几个优势:

Just-In-Time (JIT) compilation is a technique where programs are compiled at runtime, rather than being pre-compiled during installation or deployment. DeepGEMM adopts a fully JIT-compiled design, where all computational kernels are compiled during actual runtime. This offers several advantages:

  • 可以将矩阵形状、块大小等作为编译时常量处理,从而节省计算资源并允许更多编译优化;

    Matrix shapes, block sizes, etc., can be treated as compile-time constants, saving computational resources and enabling more compiler optimizations;

  • 自动为当前任务选择最佳参数配置,而无需人工调整;

    Automatically selects the optimal parameter configuration for the current task without manual adjustment;

  • 完全展开计算流水线,让编译器有更多优化空间,特别有利于处理小规模矩阵。

    Fully unrolls the computation pipeline, giving the compiler more optimization space, which is particularly beneficial for handling small-scale matrices.

这种即时编译方法显著提高了小矩阵形状的计算性能,技术思路类似于 Triton 等现代编译器。

This JIT compilation approach significantly improves computational performance for small matrix shapes, following a technical philosophy similar to modern compilers like Triton.

CUDACUTLASS

CUDA 是 NVIDIA 开发的并行计算平台和编程模型,允许开发者利用 GPU 强大的并行处理能力。这是编写 GPU 程序的基础工具。

CUDA is a parallel computing platform and programming model developed by NVIDIA, allowing developers to leverage the powerful parallel processing capabilities of GPUs. It is the foundational tool for writing GPU programs.

CUTLASS 是 NVIDIA 的开源矩阵乘法库,提供了高性能的矩阵计算模板。DeepGEMM 借鉴了 CUTLASS 的一些思路,但没有直接依赖其复杂的模板系统,而是自行实现了一套更简洁的代码,既保证性能又易于理解和学习。

CUTLASS is NVIDIA's open-source matrix multiplication library, providing high-performance matrix computation templates. DeepGEMM draws inspiration from some concepts in CUTLASS but does not directly depend on its complex template system. Instead, it implements its own, more concise codebase, ensuring performance while remaining easy to understand and learn from.

线程专业化技术

DeepGEMM 采用了线程专业化技术,这是一种高效的任务分工方法。在这种设计中,不同的计算线程被分配专门负责特定任务:一些负责数据移动,一些负责核心计算,一些负责结果处理。

DeepGEMM employs thread specialization technology, an efficient method of task division. In this design, different computational threads are assigned specific responsibilities: some handle data movement, some handle core computations, and some handle result processing.

这种分工使得数据移动、计算和后处理能够同时进行,形成高效的流水线,大大提高整体性能。

This division of labor allows data movement, computation, and post-processing to occur simultaneously, forming an efficient pipeline that significantly improves overall performance.

核心技术创新点

DeepGEMM 包含多项先进技术创新:

DeepGEMM incorporates several advanced technological innovations:

非标准块大小

传统上,GPU 计算通常使用标准大小的数据块。DeepGEMM 支持非标准块大小,这能更好地适应特定矩阵形状,提高硬件资源利用率。例如,对于 M=256,N=7168 的矩阵,标准块大小只能利用 112 个计算单元,而使用非标准块大小可以利用 128 个,效率提升明显。

Traditionally, GPU computations often use standard-sized data blocks. DeepGEMM supports non-standard block sizes, which can better adapt to specific matrix shapes and improve hardware resource utilization. For example, for a matrix of M=256, N=7168, standard block sizes can only utilize 112 computational units, whereas using non-standard block sizes can utilize 128 units, resulting in a noticeable efficiency improvement.

指令级优化

通过分析不同编译器版本产生的机器代码,DeepGEMM 团队发现并实现了特殊的指令排序优化。这种底层优化调整了计算指令的执行方式,使计算单元能更高效地并行工作,显著提升了 FP8 计算性能。

By analyzing machine code generated by different compiler versions, the DeepGEMM team discovered and implemented special instruction scheduling optimizations. This low-level optimization adjusts the execution order of computational instructions, enabling computational units to work in parallel more efficiently and significantly boosting FP8 computation performance.

统一调度系统

DeepGEMM 设计了一套统一的计算任务调度系统,采用特殊的排布策略,增强缓存重用效率,减少内存访问,提高整体性能。

DeepGEMM has designed a unified computational task scheduling system that employs special arrangement strategies to enhance cache reuse efficiency, reduce memory access, and improve overall performance.

使用指南

系统要求

使用 DeepGEMM 需要满足以下软硬件要求:

Using DeepGEMM requires meeting the following software and hardware requirements:

组件 最低要求 推荐版本 说明
GPU Hopper 架构 (sm_90a) NVIDIA H100 必须支持 FP8 Tensor Core
Python 3.8 3.10+ -
CUDA 12.3 12.4+ -
PyTorch 2.1 2.3+ -
CUTLASS 3.6 3.6+ 作为参考实现

安装与开发

Development (开发模式)

Development

# Submodule must be cloned
git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git
# Make symbolic links for third-party (CUTLASS and CuTe) include directories
python setup.py develop
# Test JIT compilation
python tests/test_jit.py
# Test all GEMM implements (normal, contiguous-grouped and masked-grouped)
python tests/test_core.py

Installation (安装)

Installation

python setup.py install

最后, import deep_gemm 就行了。

Finally, simply import deep_gemm.

API 概览

DeepGEMM 提供了清晰的 Python 编程接口,包括:

DeepGEMM provides a clear Python programming interface, including:

  • 标准矩阵乘法函数:用于普通神经网络计算;

    Standard matrix multiplication functions: for conventional neural network computations;

  • 连续排列分组函数:用于混合专家模型训练和批量推理;

    Contiguous layout grouping functions:

常见问题(FAQ)

DeepGEMM 相比其他矩阵计算库有哪些核心优势?

DeepGEMM 专为 NVIDIA Hopper GPU 优化,核心代码仅约 300 行,通过即时编译线程专业化,在大多数矩阵尺寸下性能优于现有方案,并支持标准与混合专家模型计算。

DeepGEMM 如何解决 FP8 计算中的精度问题?

DeepGEMM 专门优化了 FP8 精度的矩阵乘法,同时解决了 Hopper 架构在处理 FP8 计算时可能出现的精度问题,确保在提升速度的同时计算结果准确可靠。

DeepGEMM混合专家模型计算提供了哪些特殊支持?

DeepGEMM 支持两种专为混合专家模型设计的特殊数据排列方式,优化了门控网络分配输入数据给部分专家处理的计算流程,提升 MoE 模型的训练和推理效率。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。

您可能感兴趣