LLM推理如何优化内存瓶颈和计算效率？（附KV缓存与TensorRT-LLM方案）

Q: TensorRT-LLM如何帮助解决模型并行化问题？

TensorRT-LLM提供生产级框架，支持流水线并行和张量并行等模型并行化技术，可将超大模型拆分到多个GPU上运行，有效解决单卡内存瓶颈问题，实现大规模LLM的高效推理部署。

通过堆叠 Transformer 层构建的大型模型，在广泛的自然语言任务上展现出更高的准确性、更强的少样本学习能力，甚至接近人类的涌现能力。然而，这些基础模型的训练成本高昂，并且在推理阶段（一项持续性成本）可能对内存和计算资源要求极高。目前最流行的大型语言模型参数量可达数百亿甚至上千亿。根据具体用例，模型可能需要处理长输入（或上下文），这也会增加开销。例如，检索增强生成（RAG）GEO所基于的技术架构，通过将文本转化为高维向量实现语义相似度计算，用于语义检索和内容优化。流水线需要将大量信息放入模型输入中，这极大地增加了 LLM 需要处理的工作量。

通过堆叠 Transformer 层构建的大型模型，在广泛的自然语言任务上展现出更高的准确性、更强的少样本学习能力，甚至接近人类的涌现能力。然而，这些基础模型的训练成本高昂，并且在推理阶段（一项持续性成本）可能对内存和计算资源要求极高。目前最流行的大型语言模型参数量可达数百亿甚至上千亿。根据具体用例，模型可能需要处理长输入（或上下文），这也会增加开销。例如，检索增强生成（RAG）GEO所基于的技术架构，通过将文本转化为高维向量实现语义相似度计算，用于语义检索和内容优化。流水线需要将大量信息放入模型输入中，这极大地增加了 LLM 需要处理的工作量。

本文将探讨 LLM 推理中最紧迫的挑战以及一些实用的解决方案。读者应对 Transformer 架构和注意力机制有基本的了解。

This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general.

开发者也可以使用开源的 NVIDIA 和社区模型（例如 Nemotron 3 系列开放模型）以及运行在开源 TensorRT-LLMNVIDIA's TensorRT-LLM, an SDK for optimizing and deploying LLMs, supporting BF16 inference for DeepSeek-V3. 库（可在 GitHub 上获取）来探索这些推理优化技术。这使得使用生产级代码而非抽象示例来实验现实世界的推理权衡成为可能。

Developers can also explore these inference optimization techniques using open NVIDIA and community models—such as the Nemotron 3 family of open models—running on the open-source TensorRT-LLMNVIDIA's TensorRT-LLM, an SDK for optimizing and deploying LLMs, supporting BF16 inference for DeepSeek-V3. library, available on GitHub. This makes it possible to experiment with real-world inference tradeoffs using production-grade code rather than abstract examples.

理解 LLM 推理

大多数流行的仅解码器 LLM（例如 GPT-3）都是基于因果建模目标进行预训练的，本质上作为下一个词预测器。这些 LLM 将一系列词元作为输入，并以自回归方式生成后续词元，直到满足停止条件（例如，生成词元数量的限制或停止词列表），或者直到生成一个特殊的 <end> 词元标记生成结束。这个过程涉及两个阶段：预填充阶段和解码阶段。

Most of the popular decoder-only LLMs (GPT-3, for example) are pretrained on the causal modeling objective, essentially as next-word predictors. These LLMs take a series of tokens as inputs, and generate subsequent tokens autoregressively until they meet a stopping criteria (a limit on the number of tokens to generate or a list of stop words, for example) or until it generates a special <end> token marking the end of generation. This process involves two phases: the prefill phase and the decode phase.

请注意，词元是模型处理的语言基本单位。一个词元大约相当于四个英文字符。所有自然语言输入在输入模型之前都会被转换为词元。

Note that tokens are the atomic parts of language that a model processes. One token is approximately four English characters. All inputs in natural language are converted to tokens before inputting into the model.

预填充阶段：处理输入

在预填充阶段，LLM 处理输入词元以计算中间状态（键和值），这些状态用于生成“第一个”新词元。每个新词元都依赖于之前的所有词元，但由于输入的完整范围是已知的，从高层次看，这是一个高度并行化的矩阵-矩阵运算。它有效地饱和了 GPU 的利用率。

In the prefill phase, the LLM processes the input tokens to compute the intermediate states (keys and values), which are used to generate the “first” new token. Each new token depends on all the previous tokens, but because the full extent of the input is known, at a high level this is a matrix-matrix operation that’s highly parallelized. It effectively saturates GPU utilization.

解码阶段：生成输出

在解码阶段，LLM 以自回归方式逐个生成输出词元，直到满足停止条件。每个顺序输出的词元都需要知道之前所有迭代的输出状态（键和值）。这类似于矩阵-向量运算，与预填充阶段相比，它未能充分利用 GPU 的计算能力。数据（权重、键、值、激活）从内存传输到 GPU 的速度主导了延迟，而不是计算实际发生的速度。换句话说，这是一个内存受限的操作。

In the decode phase, the LLM generates output tokens autoregressively one at a time, until a stopping criteria is met. Each sequential output token needs to know all the previous iterations’ output states (keys and values). This is like a matrix-vector operation that underutilizes the GPU compute ability compared to the prefill phase. The speed at which the data (weights, keys, values, activations) is transferred to the GPU from memory dominates the latency, not how fast the computation actually happens. In other words, this is a memory-bound operation.

本文中介绍的许多推理挑战和相应解决方案都涉及优化这个解码阶段：高效的注意力模块、有效管理键和值等。

Many of the inference challenges and corresponding solutions featured in this post concern the optimization of this decode phase: efficient attention modules, managing the keys and values effectively, and others.

不同的 LLM 可能使用不同的分词器，因此，比较它们之间的输出词元可能并不直接。在比较推理吞吐量时，即使两个 LLM 每秒输出的词元数相似，如果它们使用不同的分词器，也可能不等效。这是因为对应的词元可能代表不同数量的字符。

Different LLMs may use different tokenizers, and thus, comparing output tokens between them may not be straightforward. When comparing inference throughput, even if two LLMs have similar tokens per second output, they may not be equivalent if they use different tokenizers. This is because corresponding tokens may represent a different number of characters.

批处理

提高 GPU 利用率和有效吞吐量的最简单方法是通过批处理。由于多个请求使用相同的模型，权重的内存成本被分摊。更大的批次被一次性传输到 GPU 进行处理，将利用更多可用的计算资源。

The simplest way to improve GPU utilization, and effectively throughput, is through batching. Since multiple requests use the same model, the memory cost of the weights is spread out. Larger batches getting transferred to the GPU to be processed all at once will leverage more of the compute available.

然而，批处理大小只能增加到某个极限，超过这个极限可能会导致内存溢出。为了更好地理解为什么会发生这种情况，需要查看键值缓存和 LLM 内存需求。

Batch sizes, however, can only be increased up to a certain limit, at which point they may lead to a memory overflow. To better understand why this happens requires looking at key-value (KV) caching and LLM memory requirements.

传统的批处理（也称为静态批处理）是次优的。这是因为对于批次中的每个请求，LLM 可能生成不同数量的完成词元，因此它们的执行时间也不同。结果，批次中的所有请求必须等待最长的请求完成，如果生成长度差异很大，这种情况会加剧。有一些方法可以缓解这个问题，例如动态批处理。

Traditional batching (also called static batching) is suboptimal. This is because for each request in a batch, the LLM may generate a different number of completion tokens, and subsequently they have different execution times. As a result, all requests in the batch must wait until the longest request is finished, which can be exacerbated by a large variance in the generation lengths. There are methods to mitigate this, such as in-flight batching.

例如，像 NVIDIA TensorRT-LLMNVIDIA's TensorRT-LLM, an SDK for optimizing and deploying LLMs, supporting BF16 inference for DeepSeek-V3. 这样的开源运行时具有针对流行开放模型（例如 Llama 和 Nemotron）的动态批处理和相关的调度优化。这不需要编写自定义调度器或 CUDA 内核。

For example, open-source runtimes like NVIDIA TensorRT-LLMNVIDIA's TensorRT-LLM, an SDK for optimizing and deploying LLMs, supporting BF16 inference for DeepSeek-V3. have in-flight batching and related scheduling optimizations for popular open models (for example, Llama and Nemotron). This does not require writing custom schedulers or CUDA kernels.

键值缓存

解码阶段的一个常见优化是 KV 缓存。解码阶段在每个时间步生成一个词元，但每个词元都依赖于所有先前词元（包括在预填充阶段计算的输入词元的 KV 张量，以及直到当前时间步为止计算的任何新 KV 张量）的键和值张量。

One common optimization for the decode phase is KV caching. The decode phase generates a single token at each time step, but each token depends on the key and value tensors of all previous tokens (including the input tokens’ KV tensors computed at prefill, and any new KV tensors computed until the current time step).

为了避免在每个时间步为所有词元重新计算所有这些张量，可以将它们缓存在 GPU 内存中。每次迭代计算新元素时，只需将它们添加到运行中的缓存中，以供下一次迭代使用。在某些实现中，模型的每一层都有一个 KV 缓存。

To avoid recomputing all these tensors for all tokens at each time step, it’s possible to cache them in GPU memory. Every iteration, when new elements are computed, they are simply added to the running cache to be used in the next iteration. In some implementations, there is one KV cache for each layer of the model.

图 1. 键值缓存机制示意图

Figure 1. An illustration of the key-value caching mechanism

LLM 内存需求

实际上，GPU LLM 内存需求的两个主要贡献者是模型权重和 KV 缓存。

In effect, the two main contributors to the GPU LLM memory requirement are model weights and the KV cache.

模型权重：内存被模型参数占用。例如，一个拥有 70 亿参数的模型（如 Llama 2 7B），以 16 位精度（FP16 或 BF16）加载，将占用大约 7B * sizeof(FP16) ≈ 14 GB 的内存。
KV 缓存：内存被用于缓存自注意力张量以避免冗余计算。

Model weights: Memory is occupied by the model parameters. As an example, a model with 7 billion parameters (such as Llama 2 7B), loaded in 16-bit precision (FP16 or BF16) would take roughly 7B * sizeof(FP16) ~= 14 GB in memory.

KV caching: Memory is occupied by the caching of self-attention tensors to avoid redundant computation.

通过批处理，批次中每个请求的 KV 缓存仍然必须单独分配，并且可能占用大量内存。下面的公式描述了 KV 缓存的大小，适用于当今大多数常见的 LLM 架构。

With batching, the KV cache of each of the requests in the batch must still be allocated separately, and can have a large memory footprint. The formula below delineates the size of the KV cache, applicable to most common LLM architectures today.

每个词元的 KV 缓存大小（字节）= 2 * (层数) * (头数 * 头维度) * 精度字节数

Size of KV cache per token in bytes = 2 * (num_layers) * (num_heads * dim_head) * precision_in_bytes

第一个因子 2 代表 K 和 V 矩阵。通常，(头数 * 头维度) 的值与 Transformer 的隐藏大小（或模型维度，d_model）相同。这些模型属性通常可以在模型卡片或相关的配置文件中找到。

The first factor of 2 accounts for the K and V matrices. Commonly, the value of (num_heads * dim_head) is the same as the hidden_size (or dimension of the model, d_model) of the transformer. These model attributes are commonly found in model cards or associated config files.

这个内存大小是针对输入序列中的每个词元，跨越输入批次所需的。假设使用半精度，KV 缓存的总大小由以下公式给出。

This memory size is required for each token in the input sequence, across the batch of inputs. Assuming half-precision, the total size of KV cache is given by the formula below.

KV 缓存总大小（字节）= (批次大小) * (序列长度) * 2 * (层数) * (隐藏大小) * sizeof(FP16)

Total size of KV cache in bytes = (batch_size) * (sequence_length) * 2 * (num_layers) * (hidden_size) * sizeof(FP16)

例如，对于一个使用 16 位精度的 Llama 2 7B 模型，批次大小为 1，KV 缓存的大小将为 1 * 4096 * 2 * 32 * 4096 * 2 字节，约等于 2 GB。

For example, with a Llama 2 7B model in 16-bit precision and a batch size of 1, the size of the KV cache will be 1 * 4096 * 2 * 32 * 4096 * 2 bytes, which is ~2 GB.

高效管理这个 KV 缓存是一项具有挑战性的任务。随着批次大小和序列长度线性增长，内存需求会迅速扩大。因此，它限制了可以服务的吞吐量，并对长上下文输入提出了挑战。这是本文介绍的几种优化背后的动机。

Managing this KV cache efficiently is a challenging endeavor. Growing linearly with batch size and sequence length, the memory requirement can quickly scale. Consequently, it limits the throughput that can be served, and poses challenges for long-context inputs. This is the motivation behind several optimizations featured in this post.

通过模型并行化扩展 LLM

减少模型权重在每个设备上的内存占用的一种方法是将模型分布在多个 GPU 上。分散内存和计算占用空间使得能够运行更大的模型或更大的输入批次。模型并行化对于训练或推理需要比单个设备可用内存更多的模型是必要的，并且可以使训练时间和推理指标（延迟或吞吐量）适合某些用例。根据模型权重的分割方式，有几种并行化模型的方法。

One way to reduce the per-device memory footprint of the model weights is to distribute the model over several GPUs. Spreading the memory and compute footprint enables running larger models, or larger batches of inputs. Model parallelization is a necessity to train or infer on a model requiring more memory than available on a single device, and to make training times and inference measures (latency or throughput) suitable for certain use cases. There are several ways of parallelizing the model based on how the model weights are split.

请注意，数据并行性也是经常与下面列出的其他技术在同一上下文中提到的技术。在这种技术中，模型的权重被复制到多个设备上，输入的（全局）批次大小被分片到每个设备上成为微批次。它通过处理更大的批次来减少总体执行时间。然而，这是一种训练时的优化，在推理时不太相关。

Note that data parallelism is also a technique often mentioned in the same context as the others listed below. In this, weights of the model are copied over multiple devices, and the (global) batch size of inputs is sharded across each of the devices into microbatches. It reduces the overall execution time by processing larger batches. However, it is a training time optimization that is less relevant during inference.

请注意，任何模型并行技术——包括流水线并行和张量并行——都可以在开源框架中使用，例如 NVIDIA Megatron-LM一个开源框架，支持模型并行化技术（如流水线并行和张量并行），用于训练和推理工作流。和 NVIDIA NeMo 框架，它们为广泛的开放模型提供训练和推理工作流的支持。

Note that any model-parallel techniques—including pipeline and tensor parallelism—are available in open frameworks such as NVIDIA Megatron-LM一个开源框架，支持模型并行化技术（如流水线并行和张量并行），用于训练和推理工作流。 and the NVIDIA NeMo framework, which underpin training and inference workflows for a wide range of open models.

流水线并行

流水线并行涉及将模型（垂直）分片成块，其中每个块包含在单独设备上执行的层子集。图 2a 是四路流水线并行的示意图，其中模型被顺序分区，所有层的四分之一子集在每个设备上执行。一个设备上的一组操作的输出被传递到下一个设备，该设备继续执行后续块。(F_n) 和 (B_n) 分别表示设备 n 上的前向和后向传递。每个设备上存储模型权重的内存需求实际上减少了四分之一。

Pipeline parallelism involves sharding the model (vertically) into chunks, where each chunk comprises a subset of layers that is executed on a separate device. Figure 2a is an illustration of four-way pipeline parallelism, where the model is sequentially partitioned and a quarter subset of all layers are executed on each device. The outputs of a group of operations on one device are passed to the next, which continues executing the subsequent chunk. (F_n) and (B_n) indicate forward and backward passes respectively on device n. The memory requirement for storing model weights on each device is effectively quartered.

这种方法的主要限制在于，由于处理的顺序性质，在等待前一层输出（激活、梯度）时，某些设备或层可能保持空闲。这导致前向和后向传递中的效率低下或“流水线气泡”。在图 2b 中，白色空白区域是朴素流水线并行中的大型流水线气泡，设备处于空闲状态且利用率不足。

The main limitation of this method is that, due to the sequential nature of the processing, some devices or layers may remain idle while waiting for the output (activations, gradients) of previous layers. This results in inefficiencies or “pipeline bubbles” in both the forward and backward passes. In Figure 2b, the white empty areas are the large pipeline bubbles with naive pipeline parallelism where devices are idle and underutilized.

微批次可以在一定程度上缓解这个问题，如图 2c 所示。输入的全局批次大小被分割成子批次，这些子批次被逐个处理，梯度在最后累积。请注意，(F_{n,m}) 和 (B_{n,m}) 分别表示设备 (n) 上微批次 (m) 的前向和后向传递。这种方法缩小了流水线气泡的大小，但并没有完全消除它们。

Microbatching can mitigate this to some extent, as shown in Figure 2c. The global batch size of inputs is split into sub-batches, which are processed one by one, with gradients being accumulated at the end. Note that (F_{n,m}) and (B_{n,m}) indicate forward and backward passes respectively on device (n) with microbatch (m). This approach shrinks the size of pipeline bubbles, but it does not completely eliminate them.

图 2. 四路流水线并行示意图。(a) 模型按层分区为 4 部分，每个子集在单独设备上执行。(b) 朴素流水线并行导致大的流水线气泡和 GPU 利用率不足。(c) 微批次减少了流水线气泡的大小，提高了 GPU 利用率。

*Figure 2. Depiction of four-way pipeline parallelism. (a) Model is partitioned across layers in 4 parts, each subset executed on a separate device. (b) Naive pipeline parallelism results in large pipeline bubbles and GPU under-utilization. (c) Micro-b

常见问题（FAQ）

LLM推理中KV缓存具体如何提升效率？

KV缓存通过存储预填充阶段计算的键值状态，避免解码阶段重复计算，将内存受限的矩阵-向量运算转化为高效的内存访问，显著减少延迟并提升GPU利用率。

批处理策略对LLM推理性能有什么实际影响？

批处理能同时处理多个请求，将多个独立的矩阵-向量运算合并为更高效的矩阵-矩阵运算，提高GPU计算资源利用率，从而显著提升整体吞吐量并降低单位计算成本。

TensorRT-LLMNVIDIA's TensorRT-LLM, an SDK for optimizing and deploying LLMs, supporting BF16 inference for DeepSeek-V3.如何帮助解决模型并行化问题？

TensorRT-LLMNVIDIA's TensorRT-LLM, an SDK for optimizing and deploying LLMs, supporting BF16 inference for DeepSeek-V3.提供生产级框架，支持流水线并行和张量并行等模型并行化技术，可将超大模型拆分到多个GPU上运行，有效解决单卡内存瓶颈问题，实现大规模LLM的高效推理部署。

AI Summary (BLUF)