突破极限：AirLLM实现70B大模型在4GB GPU上无损推理

Introduction

Large Language Models (LLMs) require substantial GPU memory, raising the question: Is it feasible to run inference on a single GPU? If so, what is the minimum GPU memory required? A 70B parameter LLM has a size of approximately 130GB. Loading this model alone would typically require two A100 GPUs, each with 100GB of memory. During inference, the entire input sequence must also be loaded into memory for the complex "attention" computation. The memory requirement for this attention mechanism scales quadratically with input length, demanding even more memory beyond the base 130GB model size. So, what techniques can save such a massive amount of memory, enabling inference on a single 4GB GPU? Crucially, these memory optimization techniques do not involve any model compression methods like quantization, distillation, or pruning, which often sacrifice model performance. Today, we will delve into the key technologies for extreme memory optimization in large models. At the end of the article, we will also share an open-source library that achieves this with just a few lines of code!

大型语言模型（LLM）需要大量的GPU内存，这引出了一个关键问题：是否可以在单个GPU上运行推理？如果可以，所需的最小GPU内存是多少？一个700亿参数的LLM大小约为130GB。仅加载这个模型通常就需要两个内存为100GB的A100 GPU。在推理过程中，整个输入序列也必须加载到内存中进行复杂的"注意力"计算。这种注意力机制的内存需求与输入长度呈二次方关系，这意味着除了基础的130GB模型大小之外，还需要更多的内存。那么，什么技术可以节省如此巨大的内存，从而实现在单个4GB GPU上进行推理呢？关键的是，这些内存优化技术不涉及任何会牺牲模型性能的模型压缩方法，如量化、蒸馏或剪枝。今天，我们将深入探讨大型模型极限内存优化的关键技术。在文章最后，我们还将分享一个只需几行代码即可实现此功能的开源库！

Key Technologies for Memory Optimization

1. Layered Inference

The most critical technique is Layered Inference. This is essentially the fundamental "divide and conquer" approach from computer science applied to model execution. Let's first examine the architecture of a large language model. Modern LLMs are built upon the Multi-head Self-Attention structure proposed in the seminal Google paper "Attention Is All You Need," commonly known as the Transformer architecture.

A typical large model begins with an embedding projection layer, followed by a stack of (e.g., 80) identical Transformer layers, and concludes with a normalization and a fully connected layer to predict token probabilities. During inference, these layers execute sequentially. The output of one layer serves as the input to the next. Only one layer is active at any given moment. Therefore, there is absolutely no need to keep all layers resident in GPU memory simultaneously.

We can load any required layer from disk just as it is about to be executed, perform all its computations, and then completely release its memory. With this approach, the GPU memory required at any point is roughly the parameter size of a single Transformer layer—about 1/80th of the full model, or approximately 1.6GB for a 70B model.

Additionally, some output caches are stored in GPU memory, the largest of which is the Key-Value (KV) Cache, used to avoid recomputation in the attention mechanism. A simple calculation for a 70B model shows the KV cache size is approximately:
2 * input_length * num_layers * num_heads * vector_dim * 4 bytes
For an input length of 100, this cache equals 2 * 100 * 80 * 8 * 128 * 4 ≈ 30MB of GPU memory. According to our monitoring, the entire inference process uses less than 4GB of GPU VRAM!

最关键的技术是分层推理将大型语言模型的每一层权重拆分，分别加载进行推理的技术，减少单次内存占用。。这本质上是将计算机科学中基本的"分而治之"方法应用于模型执行。我们首先看一下大语言模型的架构。现代LLM建立在谷歌开创性论文《Attention Is All You Need》中提出的多头自注意力结构之上，即通常所说的Transformer架构。

一个典型的大型模型始于一个嵌入投影层，之后是堆叠的（例如80个）相同的Transformer层，最后是一个归一化层和一个全连接层来预测词元概率。在推理过程中，这些层按顺序执行。一层的输出作为下一层的输入。在任何给定时刻，只有一层是活跃的。因此，完全不需要同时将所有层保留在GPU内存中。

我们可以在某一层即将执行时从磁盘加载它，执行其所有计算，然后完全释放其内存。采用这种方法，任何时刻所需的GPU内存大约仅为单个Transformer层的参数大小——约为完整模型的1/80，对于一个700亿参数的模型来说大约是1.6GB。

此外，一些输出缓存也存储在GPU内存中，其中最大的是键值（KV）缓存，用于避免注意力机制中的重复计算。对一个700亿参数模型的简单计算显示，KV缓存在推理过程中存储键值对以避免重复计算的缓存机制，对内存需求有重要影响。大小约为：
2 * 输入长度 * 层数 * 头数 * 向量维度 * 4 字节
对于长度为100的输入，该缓存等于 2 * 100 * 80 * 8 * 128 * 4 ≈ 30MB 的GPU内存。根据我们的监测，整个推理过程使用的GPU显存不到4GB！

2. Single-Layer Optimization — Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。

Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。 is arguably one of the most important and critical optimizations in modern LLM development. Virtually all large language models utilize the same underlying computational kernels, with Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。 representing a major improvement.

The core idea behind Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。 optimization is not entirely novel; we must acknowledge the earlier paper "Self-attention Does Not Need O(n²) Memory." Originally, self-attention required O(n²) memory (where n is the sequence length). That paper demonstrated that we don't need to retain all O(n²) intermediate results. We can compute them sequentially, continuously updating a running intermediate result and discarding others, thereby reducing memory complexity to O(log n).

Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。 follows a similar principle, albeit with a slightly higher O(n) memory complexity. However, its true power lies in deeply optimizing CUDA memory access patterns, achieving significant speedups (multiple times faster) for both inference and training compared to naive implementations. As illustrated in relevant diagrams, the original self-attention computes and stores O(n²) intermediate matrices. Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。 breaks the computation into many small blocks, processes them tile by tile, and reduces the memory footprint to the size of a single tile.

Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。 可以说是现代LLM开发中最重要、最关键的优化之一。几乎所有大型语言模型都使用相同的底层计算内核，而Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。代表了一项重大改进。

Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。优化的核心思想并非完全新颖；我们必须提及更早的论文《Self-attention Does Not Need O(n²) Memory》。最初，自注意力需要O(n²)内存（其中n是序列长度）。该论文证明，我们不需要保留所有O(n²)的中间结果。我们可以顺序计算它们，不断更新一个运行的中间结果并丢弃其他结果，从而将内存复杂度降低到O(log n)。

Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。遵循类似的原理，尽管其内存复杂度稍高，为O(n)。然而，其真正的威力在于深度优化了CUDA内存访问模式，与原始实现相比，在推理和训练上都实现了显著的加速（数倍）。正如相关图示所示，原始的自注意力计算并存储O(n²)的中间矩阵。Flash Attention一种优化的自注意力计算实现，通过分块计算和内存访问优化，显著降低内存需求和计算时间。将计算分解为许多小块，逐块处理，并将内存占用减少到一个块的大小。

3. Model File Sharding

Original model files are often sharded into multiple chunks, typically around 10GB each. Our execution process is layer-by-layer. Each layer is only about 1.6GB. If we load based on the original 10GB shards, every layer execution would require reloading an entire 10GB file while only using 1.6GB. This process wastes significant memory bandwidth and, more critically, incurs excessive disk I/O.

Disk read speed is often the slowest bottleneck in the entire inference pipeline, so we aim to minimize it as much as possible. Therefore, we preprocess the original Hugging Face model files by performing layer-wise sharding. For storage, we utilize the safetensors technology (https://github.com/huggingface/safetensors). Safetensors ensures the storage format closely matches the in-memory layout and employs memory-mapped loading for maximum speed.

原始模型文件通常被分片成多个块，每块通常约为10GB。我们的执行过程是逐层的。每层只有大约1.6GB。如果我们基于原始的10GB分片加载，那么每一层的执行都需要重新加载整个10GB文件，而只使用其中的1.6GB。这个过程浪费了大量的内存带宽，更关键的是，导致了过多的磁盘I/O。

磁盘读取速度通常是整个推理流程中最慢的瓶颈，因此我们力求尽可能地减少它。为此，我们对原始的Hugging Face模型文件进行预处理，执行分层分片。对于存储，我们利用safetensors技术（https://github.com/huggingface/safetensors）。Safetensors确保存储格式与内存布局紧密匹配，并采用内存映射加载以实现最大速度。

4. Meta Device

In our implementation, we leverage the meta device functionality provided by Hugging Face Accelerate (https://huggingface.co/docs/accelerate/usage_guides/big_modeling). The meta device is a virtual device specifically designed for running extremely large models. When you load a model via the meta device, the actual model data is not read into memory; only the code structure (the "skeleton") is loaded. The memory usage is effectively 0.

You can dynamically transfer parts of the model from the meta device to a real device (e.g., CPU or GPU) during execution. Only at that moment is the data truly loaded into memory. Using init_empty_weights() allows model loading via the meta device:

from accelerate import init_empty_weights
with init_empty_weights():
    my_model = ModelClass(...)

在我们的实现中，我们利用了Hugging Face Accelerate提供的元设备HuggingFace Accelerate提供的虚拟设备，允许零内存加载模型，在执行时动态传输模型部分到真实设备。功能（https://huggingface.co/docs/accelerate/usage_guides/big_modeling）。元设备是一种虚拟设备，专门为运行超大型模型而设计。当你通过元设备加载模型时，实际的模型数据*不会*被读入内存；只有代码结构（"骨架"）被加载。内存使用量实际上为**0**。

你可以在执行期间动态地将模型的部分内容从元设备HuggingFace Accelerate提供的虚拟设备，允许零内存加载模型，在执行时动态传输模型部分到真实设备。传输到真实设备（例如CPU或GPU）。只有在那时，数据才真正加载到内存中。使用 init_empty_weights() 允许通过元设备HuggingFace Accelerate提供的虚拟设备，允许零内存加载模型，在执行时动态传输模型部分到真实设备。加载模型：
from accelerate import init_empty_weights
with init_empty_weights():
    my_model = ModelClass(...)

Open-Source Library: AirLLM

We have open-sourced all the code as AirLLM, enabling you to achieve this with just a few lines of code. It is available on the Anima GitHub repository: https://github.com/lyogavin/Anima/tree/main/air_llm.

Usage is straightforward. First, install the package:

pip install airllm

Then, perform layered inference similar to a standard Transformer model:

from airllm import AirLLMLlama2

MAX_LENGTH = 128
# Use a Hugging Face model repo ID:
model = AirLLMLlama2("garage-bAInd/Platypus2-70B-instruct")

# Or use a local path to the model...
# model = AirLLMLlama2("/path/to/your/model")

input_text = [
    'What is the capital of the United States?',
]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=True)

generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])
print(output)

We have tested this code on a 16GB Nvidia T4 GPU. The entire inference process uses less than 4GB of GPU memory. Note that inference speed on lower-end GPUs like the T4 can be quite slow. This approach is less suitable for interactive scenarios like chatbots but is ideal for offline data analysis tasks such as RAG (Retrieval-Augmented Generation) or PDF analysis. Currently, support is limited to models based on the Llama 2 architecture. If you need support for other model families, please leave a comment!

我们已经将所有代码开源为 AirLLM，使您能够用几行代码实现这一功能。它可以在Anima GitHub仓库中找到：https://github.com/lyogavin/Anima/tree/main/air_llm。

使用方法非常简单。首先，安装包：
pip install airllm
然后，可以像使用标准Transformer模型一样进行分层推理将大型语言模型的每一层权重拆分，分别加载进行推理的技术，减少单次内存占用。：
from airllm import AirLLMLlama2

MAX_LENGTH = 128
# 使用Hugging Face模型仓库ID：
model = AirLLMLlama2("garage-bAInd/Platypus2-70B-instruct")

# 或者使用模型的本地路径...
# model = AirLLMLlama2("/path/to/your/model")

input_text = [
    'What is the capital of the United States?',
]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH,
    padding=True)

generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
我们已在16GB的Nvidia T4 GPU上测试了此代码。整个推理过程使用的GPU内存不到4GB。请注意，在像T4这样的低端GPU上，推理速度可能相当慢。这种方法不太适合聊天机器人这样的交互式场景，但对于RAG（检索增强生成）或PDF分析等离线数据分析任务来说是理想的选择。目前，支持仅限于基于Llama 2架构的模型。如果您需要其他模型系列的支持，请发表评论！

Can 70B Model Training Be Done on a Single GPU?

While inference can be optimized via layering, can training be similarly adapted to work on a single GPU? The key difference lies in the data requirements.

Inference only needs the output from the previous layer to execute the next Transformer layer, allowing for sequential, layer-by-layer execution with limited data persistence. Training, however, demands much more data retention. The training process involves a forward pass to compute the output of each layer and tensor, followed by a backward pass to calculate gradients for each parameter.

Gradient calculation requires saving the results from the forward pass of all preceding layers. Therefore, simply executing layers sequentially does not reduce memory because the intermediate activations needed for the backward pass must remain accessible. Other techniques, such as Gradient Checkpointing, are required to achieve similar memory reduction for training. If you are interested in how gradient checkpointing can significantly lower training memory requirements, please leave a comment!

虽然推理可以通过分层进行优化，但训练能否类似地适配到单个GPU上运行呢？关键区别在于数据需求。

推理只需要上一层的输出来执行下一个Transformer层，允许在有限的数据持久化下进行顺序的、逐层的执行。然而，训练需要保留更多的数据。训练过程涉及前向传播以计算每层和张量的输出，然后是反向传播以计算每个参数的梯度。

梯度计算需要保存所有前层前向传播的结果。因此，简单地顺序执行层并不能减少内存，因为反向传播所需的中间激活值必须保持可访问状态。需要其他技术，例如梯度检查点，来实现训练中类似的内存减少。如果您对梯度检查点如何显著降低训练内存需求感兴趣，请发表评论！

Article Source: https://medium.com/ai-advances/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb