FlashMLA：DeepSeek为Hopper GPU打造的高性能解码内核

Introduction

Today marks the beginning of DeepSeek's Open Source Week. While the acronym "FlashMLA" might seem like a jumble of familiar letters, its significance is profound. To demystify this technology, we've prepared a concise guide. FlashMLA is an efficient MLA (Multi-Head Latent Attention) decoding kernel optimized for Hopper GPUs, supporting variable-length sequence processing and is now ready for production use. By optimizing MLA decoding and paged KV caching, FlashMLA significantly boosts the inference efficiency of Large Language Models (LLMs), unleashing peak performance on high-end GPUs like the H100/H800.

今天，我们正式进入 DeepSeek 开源周。虽然 "FlashMLA" 这个缩写里的每个字母都认识，但连在一起其含义却显得高深。为了厘清这项技术，我们准备了一份简明指南。FlashMLA 是一个针对 Hopper GPU 优化的高效 MLA（多头潜在注意力）解码内核，支持变长序列处理，现已投入生产使用。通过优化 MLA 解码和分页 KV 缓存，FlashMLA 能显著提升大语言模型（LLM）的推理效率，在 H100 / H800 这样的高端 GPU 上释放极致性能。

In simpler terms, FlashMLA is an advanced technology specifically designed for Hopper high-performance AI chips—a "multi-layer attention decoding kernel." Think of it as a super-efficient "translator" that enables computers to process linguistic information much faster. It handles language data of varying lengths with remarkable speed. For instance, when interacting with a chatbot, it ensures quicker responses without lag. It achieves this efficiency primarily by optimizing complex computational processes, akin to upgrading the computer's "brain" to be smarter and more efficient at language tasks.

通俗地说，FlashMLA 是一种专为 Hopper 高性能 AI 芯片设计的先进技术——一种「多层注意力解码内核」。可以将其想象成一个超级高效的「翻译器」，能让计算机更快地处理语言信息。它能以极高的速度处理各种长度的语言序列。例如，在使用聊天机器人时，它能带来更快的响应且无卡顿。其效率提升主要源于对复杂计算过程的优化，这就像为计算机的「大脑」进行了一次升级，使其在处理语言任务时更智能、更高效。

Key Concepts and Inspiration

DeepSeek officially notes that FlashMLA draws inspiration from the FlashAttention 2&3 and the CUTLASS project. FlashAttention is an efficient attention computation method optimized for the self-attention mechanism in Transformer models (like GPT, BERT), with the core goal of reducing memory footprint and accelerating computation. CUTLASS is another optimization library that aids in improving computational efficiency.

DeepSeek 官方特意提到，FlashMLA 的灵感来自 FlashAttention 2&3 和 CUTLASS 项目。FlashAttention 是一种高效的注意力计算方法，专门针对 Transformer 模型（如 GPT、BERT）的自注意力机制进行优化，其核心目标是减少显存占用并加速计算。CUTLASS 也是一个旨在提升计算效率的优化工具库。

DeepSeek's rapid rise to prominence is largely attributed to creating high-performance models at a lower cost. The secret behind this success lies in innovations in model architecture and training techniques, particularly the application of Mixture of Experts (MoE) and Multi-head Latent Attention (MLA) technologies. FlashMLA is DeepSeek's implementation and optimized version specifically for the MLA technique.

DeepSeek 的爆火出圈，很大程度上源于其以较低成本打造了高性能模型。这背后的秘诀主要得益于其在模型架构和训练技术上的创新，尤其是混合专家（MoE）和多头潜在注意力（MLA）技术的应用。而 FlashMLA 正是 DeepSeek 公司为 MLA 技术开发的一种实现和优化版本。

What is MLA (Multi-head Latent Attention)?

In traditional language models, there is a technique called Multi-Head Attention (MHA). It allows the computer to better understand language, akin to how humans use their eyes to focus on multiple points simultaneously. However, this technique has a drawback: it requires significant memory to store information, like a very spacious "warehouse." But an overly large warehouse can lead to wasted space.

在传统的语言模型中，有一种称为「多头注意力（MHA）」的技术。它能让计算机更好地理解语言，就像人眼可以同时关注多个地方一样。然而，这项技术有个缺点：需要大量内存来存储信息，就像一个容量很大的「仓库」。但仓库过大会导致空间浪费。

The advancement in MLA lies in a method called "low-rank decomposition." It compresses that large warehouse into a smaller one while maintaining the same functionality. It's like replacing a large refrigerator with a compact one that can still hold all the contents. This approach not only saves space when handling language tasks but also increases speed. Crucially, although MLA compresses the "warehouse," its performance remains as effective as the original, without compromise.

MLA 的升级之处在于一种名为「低秩分解」的方法。它将那个大仓库压缩成一个小仓库，同时保持功能不变。就像把一个大冰箱换成一个小冰箱，但里面的东西依然放得下。这样一来，在处理语言任务时，不仅节省了空间，速度还更快了。重要的是，尽管 MLA 压缩了「仓库」，但其工作效果与原先一样好，没有任何折扣。

Of course, besides MLA and MoE, DeepSeek employs other techniques to significantly reduce training and inference costs, including but not limited to low-precision training, auxiliary-loss-free load balancing strategies, and Multi-Token Prediction (MTP).

当然，除了 MLA 和 MoE，DeepSeek 还采用了其他一些技术来大幅降低训练和推理成本，包括但不限于低精度训练、无辅助损失的负载均衡策略以及多 Token 预测（MTP）。

Performance and Advantages

Performance data indicates that FlashMLA far surpasses traditional methods under memory and computational constraints, thanks to its linear complexity design and optimizations for Hopper GPUs. Comparisons with standard multi-head attention further highlight FlashMLA's advantages.

性能数据表明，得益于其线性复杂度的设计和针对 Hopper GPU 的优化，FlashMLA 在内存和计算受限场景下的表现远超传统方法。与标准多头注意力的对比，更是进一步凸显了 FlashMLA 的优势。

Key Application Scenarios for FlashMLA include:

Long Sequence Processing: Suitable for handling texts with thousands of tokens, such as document analysis or extended dialogues. (长序列处理：适合处理数千个标记的文本，如文档分析或长对话。)
Real-time Applications: Such as chatbots, virtual assistants, and real-time translation systems, reducing latency. (实时应用：如聊天机器人、虚拟助手和实时翻译系统，降低延迟。)
Resource Efficiency: Reduces memory and computational demands, facilitating deployment on edge devices. (资源效率：减少内存和计算需求，便于在边缘设备上部署。)

Ecosystem Impact and Open Source Significance

Currently, AI training or inference primarily relies on NVIDIA H100/H800 GPUs, but the software ecosystem is still maturing. With FlashMLA being open-sourced, it can potentially be integrated into ecosystems like vLLM (an efficient LLM inference framework), Hugging Face Transformers, or Llama.cpp (lightweight LLM inference). This integration could enable more efficient operation of open-source large language models (such as LLaMA, Mistral, Falcon). The same resources can accomplish more work while saving costs.

目前，AI 训练或推理主要依赖英伟达 H100 / H800 GPU，但软件生态仍在完善中。由于 FlashMLA 的开源，未来它可以被集成到 vLLM（高效 LLM 推理框架）、Hugging Face Transformers 或 Llama.cpp（轻量级 LLM 推理）等生态中，从而有望让开源大语言模型（如 LLaMA、Mistral、Falcon）运行得更高效。实现同样的资源，处理更多的任务，同时节省成本。

Because FlashMLA offers higher computational efficiency (580 TFLOPS) and better memory bandwidth optimization (3000 GB/s), the same GPU resources can handle more requests, thereby reducing the cost per inference. For AI companies or cloud service providers, adopting FlashMLA translates to lower costs and faster inference, directly benefiting more AI companies, academic institutions, and enterprise users by improving GPU resource utilization.

因为 FlashMLA 拥有更高的计算效率（580 TFLOPS）和更好的内存带宽优化（3000 GB/s），同样的 GPU 资源可以处理更多请求，从而降低单位推理成本。对于 AI 公司或云计算服务商而言，使用 FlashMLA 意味着更低的成本和更快的推理速度，让更多的 AI 公司、学术机构、企业用户直接受益，并提高 GPU 资源的利用率。

Furthermore, researchers and developers can build upon FlashMLA for further optimization. In the past, these efficient AI inference optimization technologies were typically held by giants like OpenAI and NVIDIA. Now, with FlashMLA open-sourced, smaller AI companies or independent developers can also leverage this technology. As more people enter the AI field to innovate, we can expect a surge in new AI startup projects.

此外，研究人员和开发者还可以基于 FlashMLA 做进一步的优化。过去，这些高效的 AI 推理优化技术通常主要掌握在 OpenAI、英伟达等巨头手中。但现在，随着 FlashMLA 的开源，小型 AI 公司或独立开发者也能用上这项技术。随着更多人进入 AI 领域进行创新，自然有望催生更多的 AI 创业项目。

In short, if you are an AI practitioner or developer currently using H100/H800 GPUs for LLM training or inference, FlashMLA is likely a project worth your attention or research.

简言之，如果你是 AI 从业者或开发者，最近在用 H100 / H800 训练或推理 LLM，那么 FlashMLA 可能会是一个值得关注或研究的项目。

Technical Deep Dive: The PTX Connection

Similar to how netizens uncovered details about PTX in the DeepSeek V3 paper during the Spring Festival, users on platform X have discovered that the FlashMLA project released by DeepSeek also contains a line of inline PTX code. PTX (Parallel Thread Execution) is an intermediate instruction set architecture for the CUDA platform, sitting between high-level GPU programming languages and low-level machine code. It is often considered one of NVIDIA's technical moats.

与春节期间网友扒出 DeepSeek V3 论文具体提到了 PTX 的细节相似，X 平台上的用户发现 DeepSeek 发布的 FlashMLA 项目中同样包含了一行内联 PTX 代码。PTX（并行线程执行）是 CUDA 平台的中间指令集架构，处于高级 GPU 编程语言和低级机器代码之间，通常被视为英伟达的技术护城河之一。

Using inline PTX allows developers finer control over the GPU's execution flow, potentially enabling more efficient computational performance. Additionally, directly leveraging the underlying capabilities of NVIDIA GPUs without complete reliance on CUDA may help lower the advantage of NVIDIA's technical barriers in GPU programming. In other words, this might suggest that DeepSeek is intentionally navigating around NVIDIA's relatively closed ecosystem.

通过内联 PTX，开发者能够更精细地控制 GPU 的执行流程，从而可能实现更高效的计算性能。此外，直接利用英伟达 GPU 的底层功能，而不必完全依赖于 CUDA，也有助于削弱英伟达在 GPU 编程领域的技术壁垒优势。换句话说，这或许意味着 DeepSeek 可能在有意绕开英伟达相对封闭的生态。

(Note: The original content included informal commentary about upcoming model releases and a deployment guide. The professional blog post focuses on the core technical explanation and impact. The deployment instructions and recruitment notice from the original are omitted for conciseness and to maintain a professional tone.)

（注：原文包含关于即将发布的模型非正式评论以及部署指南。本篇专业博客文章聚焦于核心技术解释和影响。为保持简洁和专业基调，省略了原文中的部署说明和招聘通知。）