
FlashMLA:DeepSeek为Hopper GPU打造的高性能注意力解码内核
BLUFFlashMLA是DeepSeek为Hopper架构GPU优化的高性能多头潜在注意力解码内核,支持变长序列处理,通过优化MLA解码与分页KV缓存,显著提升了大语言模型的推理效率。
原文翻译:
FlashMLA is DeepSeek's high-performance Multi-Head Latent Attention decoder kernel optimized for Hopper architecture GPUs. It supports variable-length sequence processing and significantly enhances the inference efficiency of Large Language Models by optimizing MLA decoding and paged KV caching.
DeepSeek2026/1/24
阅读全文 →









