FlashMLA：DeepSeek高性能注意力内核库，驱动V3模型实现660 TFLOPS

Introduction

FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现 is DeepSeek's library of highly optimized attention kernels, serving as the computational engine for the DeepSeek-V3 and DeepSeek-V3.2-Exp models. This repository contains implementations for both sparse and dense attention mechanisms, designed to maximize performance on modern GPU architectures.

FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现是 DeepSeek 的高性能注意力核函数库，为 DeepSeek-V3 和 DeepSeek-V3.2-Exp 模型提供计算引擎。该仓库包含稀疏和稠密注意力机制的实现，旨在最大化现代 GPU 架构的性能。

Core Components

Sparse Attention Kernels

These kernels implement DeepSeek Sparse Attention (DSA)DeepSeek提出的令牌级稀疏注意力机制，通过选择性计算关键令牌来提升注意力计算效率, a token-level sparse attention mechanism that significantly reduces computational overhead while maintaining model quality.

这些核函数实现了 DeepSeek 稀疏注意力（DSA），这是一种令牌级稀疏注意力通过索引张量指定需要计算注意力的令牌，跳过不重要的令牌计算机制，能在保持模型质量的同时显著降低计算开销。

Token-level sparse attention for the prefill stage (预填充阶段语言模型推理中的第一个阶段，处理所有输入令牌并生成初始KV缓存的令牌级稀疏注意力通过索引张量指定需要计算注意力的令牌，跳过不重要的令牌计算)
Token-level sparse attention for the decoding stage with FP8 KV cache (使用 FP8 KV 缓存的解码阶段语言模型推理中的生成阶段，逐个生成输出令牌令牌级稀疏注意力通过索引张量指定需要计算注意力的令牌，跳过不重要的令牌计算)

Dense Attention Kernels

Traditional dense attention implementations optimized for maximum throughput on supported hardware.

针对支持硬件进行优化的传统稠密注意力实现，旨在实现最大吞吐量。

Dense attention for the prefill stage (预填充阶段语言模型推理中的第一个阶段，处理所有输入令牌并生成初始KV缓存的稠密注意力)
Dense attention for the decoding stage (解码阶段语言模型推理中的生成阶段，逐个生成输出令牌的稠密注意力)

Recent Updates

2025.09.29: Sparse Attention Kernels Release

With the launch of DeepSeek-V3.2, we're releasing the corresponding token-level sparse attention kernels. These kernels power the model's DeepSeek Sparse Attention (DSA)DeepSeek提出的令牌级稀疏注意力机制，通过选择性计算关键令牌来提升注意力计算效率 and achieve remarkable performance: up to 640 TFlops during prefilling and 410 TFlops during decoding. We've also published a detailed technical blog about our new FP8 sparse decoding kernel.

随着 DeepSeek-V3.2 的发布，我们推出了相应的令牌级稀疏注意力通过索引张量指定需要计算注意力的令牌，跳过不重要的令牌计算核函数。这些核函数为模型的 DeepSeek 稀疏注意力（DSA）提供动力，实现了卓越的性能：预填充阶段语言模型推理中的第一个阶段，处理所有输入令牌并生成初始KV缓存高达 640 TFlops，解码阶段语言模型推理中的生成阶段，逐个生成输出令牌高达 410 TFlops。我们还发布了关于新型 FP8 稀疏解码核函数的详细技术博客。

2025.08.01: MHA Kernels for SM100 Architecture

Thanks to NVIDIA's contribution, we now support Multi-Head Attention forward and backward kernels on SM100 architecture.

感谢 NVIDIA 的贡献，我们现在支持 SM100 架构上的多头注意力前向和后向核函数。

2025.04.22: Performance Improvements

We're excited to announce the new release of FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现, which delivers 5% to 15% performance improvement for compute-bound workloads. The library now achieves up to 660 TFlops on NVIDIA H800 SXM5 GPUs. The new version maintains full interface compatibility with previous releases.

我们很高兴地宣布 FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现的新版本发布，该版本为计算密集型工作负载带来了 5% 到 15% 的性能提升。该库现在在 NVIDIA H800 SXM5 GPU 上可实现高达 660 TFlops 的性能。新版本保持了与先前版本的完全接口兼容性。

Performance Benchmarks

Decoding Performance

Dense MLA Decoding:

Memory-bound configuration: Up to 3000 GB/s
Compute-bound configuration: 660 TFLOPS on H800 SXM5 with CUDA 12.8

Token-level Sparse MLA Decoding (FP8 KV cache):

Compute-bound configuration: 410 TFLOPS on H800 SXM5 with CUDA 12.8
B200 performance: Up to 350 TFlops (not yet fully optimized)

稠密 MLA 解码：

内存受限配置：高达 3000 GB/s

计算受限配置：在 H800 SXM5 上使用 CUDA 12.8 达到 660 TFLOPS

令牌级稀疏 MLA 解码（FP8 KV 缓存）：

计算受限配置：在 H800 SXM5 上使用 CUDA 12.8 达到 410 TFLOPS

B200 性能：高达 350 TFlops（尚未完全优化）

Prefill Performance

Dense MHA Prefill (B200):

Forward computation: Up to 1460 TFlops
Backward computation: Up to 1000 TFlops

Sparse MLA Prefill:

H800 SXM5 with CUDA 12.8: Up to 640 TFlops
B200 with CUDA 12.9: Up to 1450 TFlops

稠密 MHA 预填充（B200）：

前向计算：高达 1460 TFlops

后向计算：高达 1000 TFlops

稀疏 MLA 预填充：

H800 SXM5 使用 CUDA 12.8：高达 640 TFlops

B200 使用 CUDA 12.9：高达 1450 TFlops

System Requirements

GPU Architecture: SM90 / SM100 (see support matrix below)
CUDA: 12.8 and above (CUDA 12.9+ required for SM100 kernels)
PyTorch: 2.0 and above

GPU 架构：SM90 / SM100（见下方支持矩阵）

CUDA：12.8 及以上（SM100 核函数需要 CUDA 12.9+）

PyTorch：2.0 及以上

Support Matrix

Kernel	GPU Architecture	MLA Mode [2]	KV Cache Format
Dense Decoding	SM90	MQA	BF16
Sparse Decoding	SM90 & SM100	MQA	FP8 [1]
Dense Prefill	SM100	MHA	-
Sparse Prefill	SM90 & SM100	MQA	-

[1]: For more details on using FP8 KV cache, see documents below.
[2]: "MLA Mode" refers to the mode used for MLA calculation. MQA stands for Multi-Query Attention mode (head_dim_k = 576 with head_dim_v = 512), while MHA stands for Multi-Head Attention mode (head_dim_k = 192/128 with head_dim_v = 128).

[1]：有关使用 FP8 KV 缓存的更多详细信息，请参阅下方文档。
[2]："MLA 模式"指的是用于 MLA 计算的模式。MQA 代表多查询注意力模式（head_dim_k = 576，head_dim_v = 512），而 MHA 代表多头注意力模式（head_dim_k = 192/128，head_dim_v = 128）。

Installation

git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla
cd flash-mla
git submodule update --init --recursive
pip install -v .

Usage Examples

MLA Decoding

To use the MLA decoding kernels, call get_mla_metadata once before the decoding loop to obtain tile scheduler metadata, then call flash_mla_with_kvcache in each decoding step.

要使用 MLA 解码核函数，请在解码循环前调用一次 get_mla_metadata 以获取瓦片调度器元数据，然后在每个解码步骤中调用 flash_mla_with_kvcache。

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(
    cache_seqlens,
    s_q * h_q // h_kv,
    h_kv,
    h_q,
    is_fp8,
    topk,
)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits,
        is_causal, is_fp8_kvcache, indices,
    )
    ...

Key Parameters:

s_q: Number of query tokens per sequence (如果禁用 MTP/推测解码，应为 1)
h_kv: Number of key-value heads (键值头数量)
h_q: Number of query heads (查询头数量)

FP8 KV Cache Format

When is_fp8_kvcache is set to True, the kernel reads the KV cache in the "FP8 with scale" format. Each token's KV cache occupies 656 bytes with the following structure:

当 is_fp8_kvcache 设置为 True 时，核函数以 "带缩放的 FP8" 格式读取 KV 缓存。每个令牌的 KV 缓存占用 656 字节，结构如下：

First 512 bytes: "Quantized NoPE" part containing 512 float8_e4m3 values (前 512 字节：包含 512 个 float8_e4m3 值的"量化 NoPE"部分)
Next 16 bytes: Scale factors containing 4 float32 values (接下来的 16 字节：包含 4 个 float32 值的缩放因子)
Last 128 bytes: "RoPE" part containing 64 bfloat16 values (not quantized) (最后 128 字节：包含 64 个 bfloat16 值的"RoPE"部分，未量化)

Sparse Attention with Indices Tensor

The indices tensor enables token-level sparse attention by specifying which tokens to compute attention for.

索引张量通过指定要计算注意力的令牌来实现令牌级稀疏注意力通过索引张量指定需要计算注意力的令牌，跳过不重要的令牌计算。

Shape: 3D tensor of shape (batch_size, seq_len_q, topk) (形状：形状为 (batch_size, seq_len_q, topk) 的 3D 张量)
Format: indices_in_kvcache[i][j][k] = (page_block_index) * page_block_size + (token_offset) (格式：indices_in_kvcache[i][j][k] = (页面块索引) * 页面块大小 + (令牌偏移量))
Invalid entries: Set to -1 (无效条目：设置为 -1)

Technical Implementation Details

Sparse MLA Prefill

The sparse MLA prefill kernel is called using flash_mla_sparse_fwd with the following parameters:

稀疏 MLA 预填充核函数使用 flash_mla_sparse_fwd 调用，参数如下：

# Parameters:
q: Query tensor of shape [s_q, h_q, d_qk]
kv: Key-Value tensor of shape [s_kv, h_kv, d_qk]
indices: Indices tensor of shape [s_q, h_kv, topk]
sm_scale: Scalar value

# Note: This kernel doesn't support batch dimension natively
# 注意：该核函数本身不支持批处理维度

Mathematical Equivalent

The kernel returns (out, max_logits, lse), which is mathematically equivalent to the following PyTorch operations:

该核函数返回 (out, max_logits, lse)，在数学上等价于以下 PyTorch 操作：

# Equivalent PyTorch implementation
kv = kv.squeeze(1)  # [s_kv, d_qk], h_kv must be 1
indices = indices.squeeze(1)  # [s_q, topk]
focused_kv = kv[indices]  # [s_q, topk, d_qk]

P = (Q @ focused_kv.transpose(-1, -2)) * sm_scale * math.log2(math.e)
max_logits = P.max(dim=-1)
lse = log2sumexp2(P, dim=-1, base=2)
S = exp2(P - lse)
out = S @ focused_kv

Community Support

FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现 has been adapted for various hardware platforms by the community:

FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现已被社区适配到各种硬件平台：

MetaX GPUs: Official version available at MetaX-MACA/FlashMLA (MetaX GPU：官方版本位于 MetaX-MACA/FlashMLA)
Moore Threads GPUs: Available at MooreThreads/MT-flashMLA (摩尔线程 GPU：位于 MooreThreads/MT-flashMLA)
Hygon DCU: Available at OpenDAS/MLAttention (海光 DCU：位于 OpenDAS/MLAttention)
Intellifusion NNP: Available at Intellifusion/tyllm (燧原 NNP：位于 Intellifusion/tyllm)
Iluvatar Corex GPUs: Available at Deep-Spark/FlashMLA (熠知电子 Corex GPU：位于 Deep-Spark/FlashMLA)
AMD Instinct GPUs: Available at AITER/MLA (AMD Instinct GPU：位于 AITER/MLA)

Citation

If you use FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现 in your research, please cite:

如果您在研究中使用了 FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现，请引用：

@misc{flashmla2025,
      title={FlashMLA: Efficient Multi-head Latent Attention Kernels},
      author={Jiashi Li, Shengyu Liu},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}

Acknowledgments

FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现 is inspired by and builds upon the excellent work of FlashAttention 2&3 and the CUTLASS project. We extend our gratitude to NVIDIA for their contributions to the SM100 MHA kernels and to all community members who have adapted FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现 to various hardware platforms.

FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现的灵感来源于并建立在 FlashAttention 2&3 和 CUTLASS 项目的优秀工作之上。我们感谢 NVIDIA 对 SM100 MHA 核函数的贡献，以及所有将 FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现适配到各种硬件平台的社区成员。

Note: This blog post covers the essential aspects of FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现. For complete documentation, performance tuning guides, and advanced usage examples, please refer to the official GitHub repository and accompanying technical blogs.

注意：本篇博客文章涵盖了 FlashMLADeepSeek优化的多头潜在注意力内核库，为大型语言模型提供高效的注意力计算实现的基本方面。完整的文档、性能调优指南和高级使用示例，请参考官方 GitHub 仓库和随附的技术博客。