
FlashMLA:DeepSeek开源的高效MLA解码内核,专为NVIDIA Hopper GPU优化
BLUFFlashMLA is an open-source, high-performance Multi-Head Linear Attention (MLA) decoding kernel optimized for NVIDIA Hopper architecture GPUs, designed to handle variable-length sequences efficiently. It enhances memory and computational efficiency through optimized KV caching and BF16 data format support, achieving up to 3000 GB/s memory bandwidth and 580 TFLOPS computational performance on H800 SXM5 GPUs. FlashMLA is ideal for large language model (LLM) inference and natural language processing (NLP) tasks requiring efficient decoding. (FlashMLA是DeepSeek开源的高效MLA解码内核,专为NVIDIA Hopper架构GPU优化,用于处理可变长度序列。通过优化KV缓存和采用BF16数据格式,提升了内存和计算效率,在H800 SXM5 GPU上内存带宽可达3000 GB/s,计算性能可达580 TFLOPS。适用于大语言模型推理和需要高效解码的自然语言处理任务。)
DeepSeek2026/1/23
阅读全文 →






