大型语言模型如何加速GPU内核优化？从研究到生产的技术路线

A technical discussion of how large language models can speed up GPU kernel optimization, from research ideas to production kernels.

关于大型语言模型如何加速GPU内核优化的技术探讨：从研究构想到生产级内核。

Some of the ideas in this article were tested in public: an earlier version placed as one of the four winning projects in the GPU Mode hackathon, with follow-on work explored during the xAI hackathon.

本文中的一些想法已在公开场合得到验证：一个早期版本在GPU Mode黑客松中成为四个获奖项目之一，后续工作也在xAI黑客松期间进行了探索。

Note: References to AlphaEvolve below refer to OpenEvolve, the open-source implementation of the AlphaEvolve paper: https://github.com/algorithmicsuperintelligence/openevolve

The underlying ideas and algorithms are described in Google DeepMind's AlphaEvolve paper. The implementation and supporting tooling, including the visualization dashboard and evaluation with a weighted council of LLMs, are provided by OpenEvolve. Thanks to the OpenEvolve authors for their work.

注意：下文提到的AlphaEvolve指的是OpenEvolve，即AlphaEvolve论文的开源实现：https://github.com/algorithmicsuperintelligence/openevolve

其核心思想和算法在Google DeepMind的AlphaEvolve论文中有所描述。具体的实现和支持工具，包括可视化仪表盘和由加权LLM委员会进行的评估，均由OpenEvolve提供。感谢OpenEvolve作者的工作。

1. 引言：内核优化危机 (Introduction: The Kernel Optimization Crisis)

Modern ML workloads spend much of their time in matrix multiplication. Transformer attention, convolutions, and gradient updates reduce to matrix operations executed on specialized hardware. Their efficiency determines training cost and inference latency.

现代机器学习工作负载的大部分时间都花在矩阵乘法上。Transformer注意力机制、卷积和梯度更新最终都归结为在专用硬件上执行的矩阵运算。它们的效率决定了训练成本和推理延迟。

In practice, the gap between published algorithms and production kernels is often months to years.

在实践中，已发表的算法与生产级内核之间的差距往往长达数月甚至数年。

Timeline: Paper to Production

Consider Flash Attention, published in May 2022. It reduced attention’s memory complexity from O(n²) to O(n). It took over 12 months before it was widely usable in production frameworks. Turning a paper into an optimized and deployable implementation typically requires:

以2022年5月发布的Flash Attention为例。它将注意力机制的内存复杂度从O(n²)降低到了O(n)。但直到超过12个月后，它才在生产框架中被广泛使用。将一篇论文转化为一个优化且可部署的实现通常需要：

Deep understanding of the algorithm’s mathematical properties (对算法数学特性的深刻理解)
Hardware expertise spanning memory hierarchies, instruction sets, and parallelism models (涵盖内存层次结构、指令集和并行模型的硬件专业知识)
Systems engineering for integration with existing frameworks (与现有框架集成的系统工程)
Exhaustive tuning across different input shapes and hardware configurations (针对不同输入形状和硬件配置的详尽调优)

Many changes do not survive this process.

许多改进无法通过这一过程。

The Human Bottleneck Funnel

Hundreds of optimization papers appear each year. Some are implemented. Fewer are optimized and maintained in production. For kernels, the work is specialized and slow.

每年有数百篇优化论文发表。其中一些被实现。更少的部分被优化并维护在生产环境中。对于内核而言，这项工作既专业又缓慢。

1.1 搜索空间问题 (The Search Space Problem)

Kernel optimization has a large configuration space. For a single GPU matrix multiplication kernel, the space includes:

内核优化拥有巨大的配置空间。对于一个GPU矩阵乘法内核，其空间包括：

Tile sizes: How to partition the computation (e.g., 64×64, 128×128, 256×64) (分块大小：如何划分计算，例如 64×64, 128×128, 256×64)
Block dimensions: Thread block organization (e.g., 128, 256, 512 threads) (块维度：线程块的组织方式，例如 128, 256, 512 个线程)
Memory staging: Shared memory allocation and access patterns (内存分级：共享内存分配和访问模式)
Loop ordering: Which dimensions to iterate first (循环顺序：首先迭代哪个维度)
Vectorization: SIMD width and memory coalescing (向量化：SIMD宽度和内存合并)
Pipelining stages: Overlapping computation with memory transfers (流水线阶段：计算与内存传输的重叠)

Search Space Explosion

Each parameter has multiple valid values. The product is often millions or billions of configurations. Most are slow; a few are good for specific shapes. Exhaustive search is infeasible. Random search wastes evaluations. Manual tuning does not scale.

每个参数都有多个有效值。其组合通常是数百万甚至数十亿种配置。其中大多数配置速度很慢；只有少数对特定形状表现良好。穷举搜索不可行。随机搜索浪费评估资源。手动调优无法扩展。

1.2 愿景：自我改进的AI基础设施 (The Vision: Self-Improving AI Infrastructure)

If a system can improve its own kernels, training time drops. Faster training makes more evaluation budget available for further optimization.

如果一个系统能够改进其自身的内核，训练时间就会缩短。更快的训练能为进一步的优化提供更多的评估预算。

Self-Improvement Loop

In 2025, DeepMind’s AlphaEvolve reported Gemini optimizing parts of its training pipeline, including kernel heuristics, and reducing overall training time by 1%.

2025年，DeepMind的AlphaEvolve报告称，Gemini优化了其训练管道的部分内容，包括内核启发式算法，并将整体训练时间减少了1%。

This document describes the technical basis for this approach and the mechanisms used in these systems.

本文档描述了这种方法的技术基础以及这些系统中使用的机制。

2. 计算基础 (The Computational Foundation)

This section describes the computation being optimized. The primary primitive is matrix multiplication.

本节描述了被优化的计算。主要的原语是矩阵乘法。

2.1 矩阵乘法：基础 (Matrix Multiplication: The Basics)

Multiplying an M×K matrix A by a K×N matrix B produces an M×N matrix C:

将一个 M×K 的矩阵 A 与一个 K×N 的矩阵 B 相乘，得到一个 M×N 的矩阵 C：

C[i,j] = Σₖ A[i,k] × B[k,j]

Each output element requires K multiplications and K-1 additions. The full operation uses M×N×K multiplications. For n×n matrices, the complexity is O(n³), so matrix multiplication dominates training and inference cost.

每个输出元素需要 K 次乘法和 K-1 次加法。整个操作需要 M×N×K 次乘法。对于 n×n 矩阵，复杂度为 O(n³)，因此矩阵乘法主导了训练和推理的成本。

3D Matrix Multiplication Visualization

The computation can be viewed as a 3D grid: each output element is a dot product, represented as a line through the input matrices.

该计算可以看作一个三维网格：每个输出元素是一个点积，表示为穿过输入矩阵的一条线。

2.2 分块为何重要 (Why Tiling Matters)

Modern GPUs have a memory hierarchy:

现代GPU具有内存层次结构：

Registers: Fastest, tiny (kilobytes per SM) (寄存器：最快，容量极小，每个流式多处理器几KB)
Shared Memory/L1 Cache: Fast, small (tens of KB per SM) (共享内存/L1缓存：快速，容量小，每个流式多处理器几十KB)
L2 Cache: Medium speed, medium size (megabytes) (L2缓存：中等速度，中等容量，几MB)
Global Memory (HBM): Slow, large (tens of GB) (全局内存 (HBM)：慢速，容量大，几十GB)

The bandwidth gap between levels is 10-100×. A kernel that repeatedly fetches from global memory is memory-bound. Tiling reduces global memory traffic by increasing reuse in fast memory.

各级之间的带宽差距达10-100倍。一个反复从全局内存获取数据的核函数会受到内存带宽的限制。分块通过增加在快速内存中的重用，减少了全局内存的流量。

The Matrix Tiling Problem

Tiling divides the large matrix multiplication into smaller sub-problems that fit in fast memory:

分块将大型矩阵乘法分解为适合快速内存的较小子问题：

Load a tile of A into shared memory (将A的一个分块加载到共享内存中)
Load a tile of B into shared memory (将B的一个分块加载到共享内存中)
Compute the partial result for the C tile (计算C分块的部分结果)
Accumulate and repeat (累加并重复)

The choice is the tile size.

关键在于选择分块大小。

Too small: Overhead from many tiles, underutilized compute units, insufficient data reuse (太小：大量分块导致开销，计算单元利用不足，数据重用不充分)
Too large: Tiles don’t fit in shared memory, cause register spills, reduce occupancy (太大：分块无法放入共享内存，导致寄存器溢出，降低占用率)
Just right: Maximizes data reuse while fitting in fast memory and maintaining high occupancy (恰到好处：在适应快速内存并保持高占用率的同时，最大化数据重用)

The optimal tile size depends on:

最优分块大小取决于：

Matrix dimensions (M, N, K) (矩阵维度 (M, N, K))
Data type (fp32, fp16, bf16, int8) (数据类型 (fp32, fp16, bf16, int8))
Hardware capabilities (shared memory size, register count, tensor core availability) (硬件能力 (共享内存大小、寄存器数量、张量核心可用性))
Memory access patterns (contiguous vs. strided) (内存访问模式 (连续 vs. 跨步))

This motivates autotuning and explains its cost.

这促使了自动调优的需求，也解释了其成本。

2.3 操作成本 (The Cost of Operations)

The relative cost of operations constrains algorithm design:

操作的相对成本限制了算法设计：

CPU Operation Latency

On modern CPUs, a 64-bit integer multiplication takes approximately 20 cycles. An addition? Just 1 cycle. That’s a 20× difference.

在现代CPU上，一次64位整数乘法大约需要20个周期。加法呢？只需1个周期。这是20倍的差异。

An astute reader, Benoit Jacob who is the author of Eigen https://libeigen.gitlab.io/#credits,
who knows a thing or two about linear algebra on CPUs,
pointed out that we're taking a somewhat naive approach here by comparing raw instruction latencies.
In practice, heavily optimized numerical CPU code can narrow the gap to under 5×.
Furthermore, using 64-bit integers skews the argument heavily in favor of this example:
with a smaller data type like f32, the difference in instruction cycle counts is much smaller, sometimes negligible.

That said, the precise ratio matters less than the underlying principle.
For the purpose of motivating the rest of this discussion, we'll continue with the 20× assumption.

一位敏锐的读者，Eigen库的作者Benoit Jacob（https://libeigen.gitlab.io/#credits），对CPU上的线性代数颇有了解，他指出我们这里通过比较原始指令延迟的方法有些天真。实际上，经过高度优化的数值CPU代码可以将差距缩小到5倍以下。此外，使用64位整数极大地偏向了这个例子：对于像f32这样更小的数据类型，指令周期数的差异要小得多，有时甚至可以忽略不计。

尽管如此，精确的比例不如其背后的原理重要。为了推动本文后续的讨论，我们将继续使用20倍的假设。

If the same result can be computed with fewer multiplications, additional additions may be acceptable.

如果能用更少的乘法计算出相同的结果，那么额外的加法可能是可以接受的。

Consider computing a² - b²:

考虑计算 a² - b²：

Method 1: a×a - b×b (方法1：a×a - b×b)

2 multiplications + 1 addition = 2×20 + 1 = 41 cycles (2次乘法 + 1次加法 = 2×20 + 1 = 41个周期)

Method 2: (a+b) × (a-b) (方法2：(a+b) × (a-b))

1 multiplication + 2 additions = 1×20 + 2 = 22 cycles (1次乘法 + 2次加法 = 1×20 + 2 = 22个周期)

Same result, 46% faster. This is a standard trade: reduce expensive operations at the cost of cheaper ones.

结果相同，速度提升46%。这是一种标准的权衡：以减少廉价操作为代价来减少昂贵操作。

3. 从Strassen到张量：数学框架 (From Strassen to Tensors: The Mathematical Framework)

3.1 Strassen的突破 (The Strassen Breakthrough)

In 1969, Volker Strassen showed that 2×2 matrices can be multiplied with fewer than 8 multiplications.

1969年，Volker Strassen证明了2×2矩阵的乘法可以用少于8次乘法完成。

Standard 2×2 matrix multiplication:

标准的2×2矩阵乘法：

[c₁ c₂]   [a₁ a₂]   [b₁ b₂]
[c₃ c₄] = [a₃ a₄] × [b₃ b₄]

c₁ = a₁b₁ + a₂b₃
c₂ = a₁b₂ + a₂b₄
c₃ = a₃b₁ + a₄b₃
c₄ = a₃b₂ + a₄b₄

That’s 8 multiplications and 4 additions. Total: 8×20 + 4×1 = 164 cycles.

这是8次乘法和4次加法。总计：8×20 + 4×1 = 164个周期。

Strassen found a way to do it with 7 multiplications:

Strassen找到了一种用7次乘法完成的方法：

m₁ = (a₁ + a₄)(b₁ + b₄)
m₂ = (a₃ + a₄)b₁
m₃ = a₁(b₂ - b₄)
m₄ = a₄(b₃ - b₁)
m₅ = (a₁ + a₂)b₄
m₆ = (a₃ - a₁)(b₁ + b₂)
m₇ = (a₂ - a₄)(b₃ + b₄)

c₁ = m₁ + m₄ - m₅ + m₇
c₂ = m₃ + m₅
c₃ = m₂ + m₄
c₄ = m₁ - m₂ + m₃ + m₆

That’s 7 multiplications and 18 additions. Total: 7×20 + 18×1 = 158 cycles.

这是7次乘法和18次加法。总计：7×20 + 18×1 = 158个周期。

The savings for 2×2 is small, but Strassen’s algorithm is recursive. Applied to blocks, the complexity drops from O(n³) to O(n^log₂7) ≈ O(n^2.807).

对于2×2矩阵，节省的运算量很小，但Strassen算法是递归的。应用于分块时，复杂度从O(n³)下降到O(n^log₂7) ≈ O(n^2.807)。

Benoit Jacob also offered a cleaner way to see why Strassen’s algorithm reduces asymptotic complexity despite introducing more additions. When you apply Strassen to a large 2N×2N matrix, you treat it as a 2×2 block matrix whose “scalars” are N×N submatrices. Addition of two N×N matrices is O(N²), you just walk through the entries. Multiplication of two N×N matrices is strictly more expensive than O(N²), regardless of which multiplication algorithm you use. So every time you trade one block multiplication for any bounded number of block additions, you’re replacing a superquadratic operation with quadratic ones. The additions become asymptotically free. That’s why 7 multiplications with 18 additions beats 8 multiplications with 4 additions, not because of instruction-level cycle counts, but because at scale, the multiplications dominate and each one you eliminate saves an entire recursive subproblem.

Benoit Jacob还提供了一种更清晰的方式来理解为什么Strassen算法尽管引入了更多加法，却降低了渐近复杂度。当你将Strassen算法应用于一个大的2N×2N矩阵时，你将其视为一个2×2的分块矩阵，其“标量”是N×N的子矩阵。两个N×N矩阵的加法是O(N²)，你只需遍历元素。两个N×N矩阵的乘法严格比O(N²)更昂贵，无论你使用哪种乘法算法。因此，每次你用一次分块乘法交换任何有限次数的分块加法时，你都是在用二次操作替换超二次操作。加法在渐近意义上变得免费。这就是为什么7次乘法加18次加法优于8次乘法加4次加法，不是因为指令级的周期数，而是在大规模下，乘法占主导地位，而你消除的每一次乘法都节省了整个递归子问题。

Strassen Asymptotic Insight

Lower exponents require different multiplication algorithms.

常见问题（FAQ）

LLM如何帮助解决GPU内核优化的搜索空间爆炸问题？

LLM可以加速探索巨大的配置空间（如分块大小、块维度、内存分级），将研究算法转化为生产级内核的时间从数月缩短，实现自改进AI基础设施的愿景。

为什么矩阵乘法的分块（Tiling）对GPU内核优化如此重要？

分块是计算基础的关键，它影响内存层次访问效率和操作成本。合理的分块策略能显著提升Transformer注意力等矩阵运算在专用硬件上的执行效率。

AlphaEvolve与传统的Strassen算法在优化思路上有何根本不同？

Strassen是固定数学框架，而AlphaEvolve代表从张量分解到自我演进的进化。它通过LLM加权评估实现持续优化，弥合了研究算法与生产部署的差距。

AI Summary (BLUF)