GEO

FlashMLA:DeepSeek高性能注意力内核库,驱动V3模型实现660 TFLOPS

2026/1/23
FlashMLA:DeepSeek高性能注意力内核库,驱动V3模型实现660 TFLOPS
AI Summary (BLUF)

FlashMLA is DeepSeek's optimized attention kernel library that powers DeepSeek-V3 models, featuring token-level sparse attention with FP8 KV cache support, achieving up to 660 TFLOPS performance on NVIDIA H800 GPUs. (FlashMLA是DeepSeek优化的注意力内核库,为DeepSeek-V3模型提供动力,具有令牌级稀疏注意力和FP8 KV缓存支持,在NVIDIA H800 GPU上实现高达660 TFLOPS的性能。)

Introduction

FlashMLA is DeepSeek's library of highly optimized attention kernels, serving as the computational engine for the DeepSeek-V3 and DeepSeek-V3.2-Exp models. This repository contains implementations for both sparse and dense attention mechanisms, designed to maximize performance on modern GPU architectures.

FlashMLA 是 DeepSeek 的高性能注意力核函数库,为 DeepSeek-V3 和 DeepSeek-V3.2-Exp 模型提供计算引擎。该仓库包含稀疏和稠密注意力机制的实现,旨在最大化现代 GPU 架构的性能。

Core Components

Sparse Attention Kernels

These kernels implement DeepSeek Sparse Attention (DSA), a token-level sparse attention mechanism that significantly reduces computational overhead while maintaining model quality.

这些核函数实现了 DeepSeek 稀疏注意力(DSA),这是一种令牌级稀疏注意力机制,能在保持模型质量的同时显著降低计算开销。

  • Token-level sparse attention for the prefill stage (预填充阶段令牌级稀疏注意力)
  • Token-level sparse attention for the decoding stage with FP8 KV cache (使用 FP8 KV 缓存的解码阶段令牌级稀疏注意力)

Dense Attention Kernels

Traditional dense attention implementations optimized for maximum throughput on supported hardware.

针对支持硬件进行优化的传统稠密注意力实现,旨在实现最大吞吐量。

  • Dense attention for the prefill stage (预填充阶段的稠密注意力)
  • Dense attention for the decoding stage (解码阶段的稠密注意力)

Recent Updates

2025.09.29: Sparse Attention Kernels Release

With the launch of DeepSeek-V3.2, we're releasing the corresponding token-level sparse attention kernels. These kernels power the model's DeepSeek Sparse Attention (DSA) and achieve remarkable performance: up to 640 TFlops during prefilling and 410 TFlops during decoding. We've also published a detailed technical blog about our new FP8 sparse decoding kernel.

随着 DeepSeek-V3.2 的发布,我们推出了相应的令牌级稀疏注意力核函数。这些核函数为模型的 DeepSeek 稀疏注意力(DSA)提供动力,实现了卓越的性能:预填充阶段高达 640 TFlops,解码阶段高达 410 TFlops。我们还发布了关于新型 FP8 稀疏解码核函数的详细技术博客。

2025.08.01: MHA Kernels for SM100 Architecture

Thanks to NVIDIA's contribution, we now support Multi-Head Attention forward and backward kernels on SM100 architecture.

感谢 NVIDIA 的贡献,我们现在支持 SM100 架构上的多头注意力前向和后向核函数。

2025.04.22: Performance Improvements

We're excited to announce the new release of FlashMLA, which delivers 5% to 15% performance improvement for compute-bound workloads. The library now achieves up to 660 TFlops on NVIDIA H800 SXM5 GPUs. The new version maintains full interface compatibility with previous releases.

我们很高兴地宣布 FlashMLA 的新版本发布,该版本为计算密集型工作负载带来了 5% 到 15% 的性能提升。该库现在在 NVIDIA H800 SXM5 GPU 上可实现高达 660 TFlops 的性能。新版本保持了与先前版本的完全接口兼容性。

Performance Benchmarks

Decoding Performance

Dense MLA Decoding:

  • Memory-bound configuration: Up to 3000 GB/s
  • Compute-bound configuration: 660 TFLOPS on H800 SXM5 with CUDA 12.8

Token-level Sparse MLA Decoding (FP8 KV cache):

  • Compute-bound configuration: 410 TFLOPS on H800 SXM5 with CUDA 12.8
  • B200 performance: Up to 350 TFlops (not yet fully optimized)

稠密 MLA 解码:

  • 内存受限配置:高达 3000 GB/s
  • 计算受限配置:在 H800 SXM5 上使用 CUDA 12.8 达到 660 TFLOPS

令牌级稀疏 MLA 解码(FP8 KV 缓存):

  • 计算受限配置:在 H800 SXM5 上使用 CUDA 12.8 达到 410 TFLOPS
  • B200 性能:高达 350 TFlops(尚未完全优化)

Prefill Performance

Dense MHA Prefill (B200):

  • Forward computation: Up to 1460 TFlops
  • Backward computation: Up to 1000 TFlops

Sparse MLA Prefill:

  • H800 SXM5 with CUDA 12.8: Up to 640 TFlops
  • B200 with CUDA 12.9: Up to 1450 TFlops

稠密 MHA 预填充(B200):

  • 前向计算:高达 1460 TFlops
  • 后向计算:高达 1000 TFlops

稀疏 MLA 预填充:

  • H800 SXM5 使用 CUDA 12.8:高达 640 TFlops
  • B200 使用 CUDA 12.9:高达 1450 TFlops

System Requirements

  • GPU Architecture: SM90 / SM100 (see support matrix below)
  • CUDA: 12.8 and above (CUDA 12.9+ required for SM100 kernels)
  • PyTorch: 2.0 and above
  • GPU 架构:SM90 / SM100(见下方支持矩阵)
  • CUDA:12.8 及以上(SM100 核函数需要 CUDA 12.9+)
  • PyTorch:2.0 及以上

Support Matrix

Kernel GPU Architecture MLA Mode [2] KV Cache Format
Dense Decoding SM90 MQA BF16
Sparse Decoding SM90 & SM100 MQA FP8 [1]
Dense Prefill SM100 MHA -
Sparse Prefill SM90 & SM100 MQA -

[1]: For more details on using FP8 KV cache, see documents below.
[2]: "MLA Mode" refers to the mode used for MLA calculation. MQA stands for Multi-Query Attention mode (head_dim_k = 576 with head_dim_v = 512), while MHA stands for Multi-Head Attention mode (head_dim_k = 192/128 with head_dim_v = 128).

[1]:有关使用 FP8 KV 缓存的更多详细信息,请参阅下方文档。
[2]:"MLA 模式"指的是用于 MLA 计算的模式。MQA 代表多查询注意力模式(head_dim_k = 576,head_dim_v = 512),而 MHA 代表多头注意力模式(head_dim_k = 192/128,head_dim_v = 128)。

Installation

git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla
cd flash-mla
git submodule update --init --recursive
pip install -v .

Usage Examples

MLA Decoding

To use the MLA decoding kernels, call get_mla_metadata once before the decoding loop to obtain tile scheduler metadata, then call flash_mla_with_kvcache in each decoding step.

要使用 MLA 解码核函数,请在解码循环前调用一次 get_mla_metadata 以获取瓦片调度器元数据,然后在每个解码步骤中调用 flash_mla_with_kvcache

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(
    cache_seqlens,
    s_q * h_q // h_kv,
    h_kv,
    h_q,
    is_fp8,
    topk,
)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits,
        is_causal, is_fp8_kvcache, indices,
    )
    ...

Key Parameters:

  • s_q: Number of query tokens per sequence (如果禁用 MTP/推测解码,应为 1)
  • h_kv: Number of key-value heads (键值头数量)
  • h_q: Number of query heads (查询头数量)

FP8 KV Cache Format

When is_fp8_kvcache is set to True, the kernel reads the KV cache in the "FP8 with scale" format. Each token's KV cache occupies 656 bytes with the following structure:

is_fp8_kvcache 设置为 True 时,核函数以 "带缩放的 FP8" 格式读取 KV 缓存。每个令牌的 KV 缓存占用 656 字节,结构如下:

  1. First 512 bytes: "Quantized NoPE" part containing 512 float8_e4m3 values (前 512 字节:包含 512 个 float8_e4m3 值的"量化 NoPE"部分)
  2. Next 16 bytes: Scale factors containing 4 float32 values (接下来的 16 字节:包含 4 个 float32 值的缩放因子)
  3. Last 128 bytes: "RoPE" part containing 64 bfloat16 values (not quantized) (最后 128 字节:包含 64 个 bfloat16 值的"RoPE"部分,未量化)

Sparse Attention with Indices Tensor

The indices tensor enables token-level sparse attention by specifying which tokens to compute attention for.

索引张量通过指定要计算注意力的令牌来实现令牌级稀疏注意力

  • Shape: 3D tensor of shape (batch_size, seq_len_q, topk) (形状:形状为 (batch_size, seq_len_q, topk) 的 3D 张量)
  • Format: indices_in_kvcache[i][j][k] = (page_block_index) * page_block_size + (token_offset) (格式:indices_in_kvcache[i][j][k] = (页面块索引) * 页面块大小 + (令牌偏移量))
  • Invalid entries: Set to -1 (无效条目:设置为 -1)

Technical Implementation Details

Sparse MLA Prefill

The sparse MLA prefill kernel is called using flash_mla_sparse_fwd with the following parameters:

稀疏 MLA 预填充核函数使用 flash_mla_sparse_fwd 调用,参数如下:

# Parameters:
q: Query tensor of shape [s_q, h_q, d_qk]
kv: Key-Value tensor of shape [s_kv, h_kv, d_qk]
indices: Indices tensor of shape [s_q, h_kv, topk]
sm_scale: Scalar value

# Note: This kernel doesn't support batch dimension natively
# 注意:该核函数本身不支持批处理维度

Mathematical Equivalent

The kernel returns (out, max_logits, lse), which is mathematically equivalent to the following PyTorch operations:

该核函数返回 (out, max_logits, lse),在数学上等价于以下 PyTorch 操作:

# Equivalent PyTorch implementation
kv = kv.squeeze(1)  # [s_kv, d_qk], h_kv must be 1
indices = indices.squeeze(1)  # [s_q, topk]
focused_kv = kv[indices]  # [s_q, topk, d_qk]

P = (Q @ focused_kv.transpose(-1, -2)) * sm_scale * math.log2(math.e)
max_logits = P.max(dim=-1)
lse = log2sumexp2(P, dim=-1, base=2)
S = exp2(P - lse)
out = S @ focused_kv

Community Support

FlashMLA has been adapted for various hardware platforms by the community:

FlashMLA 已被社区适配到各种硬件平台:

Citation

If you use FlashMLA in your research, please cite:

如果您在研究中使用了 FlashMLA,请引用:

@misc{flashmla2025,
      title={FlashMLA: Efficient Multi-head Latent Attention Kernels},
      author={Jiashi Li, Shengyu Liu},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}

Acknowledgments

FlashMLA is inspired by and builds upon the excellent work of FlashAttention 2&3 and the CUTLASS project. We extend our gratitude to NVIDIA for their contributions to the SM100 MHA kernels and to all community members who have adapted FlashMLA to various hardware platforms.

FlashMLA 的灵感来源于并建立在 FlashAttention 2&3 和 CUTLASS 项目的优秀工作之上。我们感谢 NVIDIA 对 SM100 MHA 核函数的贡献,以及所有将 FlashMLA 适配到各种硬件平台的社区成员。

Note: This blog post covers the essential aspects of FlashMLA. For complete documentation, performance tuning guides, and advanced usage examples, please refer to the official GitHub repository and accompanying technical blogs.

注意:本篇博客文章涵盖了 FlashMLA 的基本方面。完整的文档、性能调优指南和高级使用示例,请参考官方 GitHub 仓库和随附的技术博客。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。