AirLLM：4GB GPU上运行700亿参数大模型的开源框架

In the ever-evolving landscape of artificial intelligence, one of the most significant challenges has been the resource demands of large language models (LLMs). Models with billions of parameters, such as the 70B-parameter LLMs, typically require high-end GPUs with substantial VRAM视频随机存取存储器，GPU上的专用内存，用于存储模型参数和中间计算结果，AirLLM通过优化技术显著降低了对VRAM的需求。—often 24GB or more—to run efficiently. However, a groundbreaking project called AirLLM is turning heads by enabling these massive models to run on a modest 4GB GPU. This blog post explores what AirLLM is, how it works, its performance trade-offs, and how you can get started.

在人工智能日新月异的领域中，大型语言模型（LLM）的资源需求一直是最重大的挑战之一。拥有数百亿参数的模型，例如 700 亿参数的 LLM，通常需要配备大量显存（通常为 24GB 或更高）的高端 GPU 才能高效运行。然而，一个名为 AirLLM 的开创性项目正引起广泛关注，它使得这些庞大的模型能够在仅 4GB 显存的 GPU 上运行。本文将探讨 AirLLM 是什么、其工作原理、性能权衡以及如何开始使用。

What is AirLLM?

AirLLM is an innovative open-source project designed to optimize the memory usage of large language models during inference. Developed by Gavin Li and hosted on GitHub, AirLLM allows users to run 70B-parameter LLMs—such as Llama 3.1—on a single 4GB GPU card without relying on traditional model压缩 techniques like quantization, distillation, or pruning. This is a game-changer for individuals and organizations with limited hardware resources, democratizing access to cutting-edge AI.

AirLLM 是一个创新的开源项目，旨在优化大型语言模型在推理过程中的内存使用。由 Gavin Li 开发并在 GitHub 上托管，AirLLM 允许用户在单个 4GB 显存的 GPU 上运行 700 亿参数的 LLM（如 Llama 3.1），而无需依赖量化、蒸馏或剪枝等传统模型压缩技术。这对于硬件资源有限的个人和组织来说是一个改变游戏规则的工具，它使得前沿 AI 技术更加普及。

The project has garnered significant attention, boasting over 6.5k stars on GitHub, and was recently highlighted in a post by Md Ismail Šojal on X. The post showcased AirLLM’s capability to handle 70B models with layer-wise inference and optional quantization, sparking a lively discussion among AI enthusiasts.

该项目获得了广泛关注，在 GitHub 上拥有超过 6.5k 的星标，最近还被 Md Ismail Šojal 在 X 上的一篇帖子重点介绍。该帖子展示了 AirLLM 通过分层推理和可选量化AirLLM提供的4位和8位分块量化选项，可在保持较高精度的同时实现最高3倍的推理加速。处理 700 亿参数模型的能力，引发了 AI 爱好者的热烈讨论。

How Does AirLLM Work?

AirLLM achieves this feat through a clever combination of memory optimization techniques and layer-wise offloading. Here’s a breakdown of the key mechanisms:

AirLLM 通过巧妙结合内存优化AirLLM采用的高级内存管理策略，包括预取技术，通过重叠模型加载和计算来提高效率，最高可提升10%性能。技术和分层卸载一种内存优化技术，将大模型拆分为多个层，仅将当前推理所需的层加载到GPU内存，其他层存储在CPU和RAM中，按需交换。来实现这一壮举。以下是其关键机制的解析：

1. Layer-Wise Offloading

Instead of loading the entire 70B-parameter model (which would require ~130GB of memory in full precision) into the GPU at once, AirLLM splits the model into layers. These layers are offloaded to the CPU and RAM when not in use, with only the necessary layers loaded onto the 4GB GPU for inference. This approach minimizes VRAM视频随机存取存储器，GPU上的专用内存，用于存储模型参数和中间计算结果，AirLLM通过优化技术显著降低了对VRAM的需求。 usage while maintaining the model’s full precision, avoiding the accuracy trade-offs associated with quantization.

与一次性将整个 700 亿参数模型（全精度下需要约 130GB 内存）加载到 GPU 不同，AirLLM 将模型分割成层。不使用的层会被卸载到 CPU 和 RAM 中，只有必要的层才会被加载到 4GB 显存的 GPU 上进行推理。这种方法在保持模型全精度的同时，最大限度地减少了显存使用，避免了与量化相关的精度损失。

2. Memory Optimization

The project leverages advanced memory management strategies, such as prefetching (introduced in version 2.5), to overlap model loading and computation, improving efficiency by up to 10%. This ensures smooth inference even on low-end hardware.

该项目利用了先进的内存管理策略，例如预取（在 2.5 版本中引入），以重叠模型加载和计算过程，将效率提升高达 10%。这确保了即使在低端硬件上也能进行流畅的推理。

3. Support for Larger Models

Recent updates have expanded AirLLM’s capabilities. As of version 2.11.0 (August 2024), it supports running the massive 405B-parameter Llama 3.1 model on an 8GB VRAM视频随机存取存储器，GPU上的专用内存，用于存储模型参数和中间计算结果，AirLLM通过优化技术显著降低了对VRAM的需求。 GPU, further pushing the boundaries of what’s possible with limited resources.

最近的更新扩展了 AirLLM 的能力。截至 2.11.0 版本（2024 年 8 月），它支持在 8GB 显存的 GPU 上运行庞大的 4050 亿参数 Llama 3.1 模型，进一步突破了有限资源下的可能性边界。

4. Optional Quantization

While AirLLM avoids quantization by default to preserve accuracy, it offers 4-bit and 8-bit block-wise quantization options (introduced in version 2.0) for a potential 3x speedup in inference, with minimal accuracy loss. This is detailed in the project’s documentation and linked to research on quantization, though the core method predates this 2022 study.

虽然 AirLLM 默认避免量化以保持精度，但它提供了 4 位和 8 位的分块量化选项（在 2.0 版本中引入），可以在精度损失最小的情况下，将推理速度提升高达 3 倍。这在项目文档中有详细说明，并链接了量化研究，尽管其核心方法早于这项 2022 年的研究。

Performance and Trade-Offs

While AirLLM is impressive, it comes with trade-offs. Community feedback on X reveals that inference speed is a bottleneck, with estimates ranging from 0.7 tokens per second to as slow as one token per hour in extreme cases. This is largely due to the overhead of layer offloading and disk I/O. For comparison, a user mentioned a trade-off of 30 seconds per word, highlighting the need for patience or faster storage solutions like SSDs.

尽管 AirLLM 令人印象深刻，但它也存在权衡。X 上的社区反馈显示，推理速度是一个瓶颈，估计范围从每秒 0.7 个令牌到极端情况下每小时一个令牌。这主要是由于层卸载和磁盘 I/O 的开销。例如，有用户提到每生成一个单词需要 30 秒的权衡，这凸显了需要耐心或更快的存储解决方案（如 SSD）。

Despite the speed limitations, the ability to run unquantized 70B models on a 4GB GPU is a remarkable achievement. The project also supports a wide range of models, including Llama 3, Qwen2.5, ChatGLM, and Mistral, making it versatile for various use cases.

尽管存在速度限制，但能够在 4GB 显存的 GPU 上运行未量化的 700 亿参数模型是一项非凡的成就。该项目还支持广泛的模型，包括 Llama 3、Qwen2.5、ChatGLM 和 Mistral，使其适用于各种用例。

Getting Started with AirLLM

Ready to try AirLLM yourself? Here’s a step-by-step guide based on the official GitHub repository:

准备好亲自尝试 AirLLM 了吗？以下是根据官方 GitHub 仓库整理的逐步指南：

1. Installation

First, install the AirLLM package via pip:
pip install airllm
For quantization support, also install bitsandbytes:
pip install -U bitsandbytes

首先，通过 pip 安装 AirLLM 包：pip install airllm
如需量化支持，还需安装 bitsandbytes：pip install -U bitsandbytes

2. Inference Example

Initialize a model and run inference using the AutoModel class. Here’s an example for a 70B Llama model:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct", compression='4bit')  # Optional 4-bit quantization
input_text = ['What is the capital of the United States?']
input_tokens = model.tokenizer(input_text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH, padding=False)
generation_output = model.generate(input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)

This code loads the model layer-wise and generates text, with the option to enable compression for faster performance.

使用 AutoModel 类初始化模型并运行推理。以下是一个 700 亿参数 Llama 模型的示例：
此代码分层加载模型并生成文本，可以选择启用压缩以获得更快的性能。

3. MacOS Support

AirLLM also works on MacOS with Apple Silicon, requiring mlx and torch. Check the example notebook.

AirLLM 同样适用于搭载 Apple Silicon 的 MacOS，需要 mlx 和 torch。请查看示例笔记本。

4. Additional Configurations

Use compression='4bit' or '8bit' for quantized inference. (使用 compression='4bit' 或 '8bit' 进行量化推理。)
Set profiling_mode=True to monitor time consumption. (设置 profiling_mode=True 以监控时间消耗。)
Specify layer_shards_saving_path for custom storage of split layers. (指定 layer_shards_saving_path 用于自定义分割层的存储路径。)

(Note: Due to length constraints, the remaining sections (Community Insights, Implications, Conclusion) are summarized below. The full analysis would continue exploring community feedback, the project's potential impact on edge AI and accessibility, and final thoughts on its role in democratizing LLM technology.)

(注意：由于篇幅限制，其余部分（社区见解、影响、结论）总结如下。完整的分析将继续探讨社区反馈、项目对边缘 AI 和可访问性的潜在影响，以及对其在普及 LLM 技术中作用的最终思考。)

Community Insights and Future Potential

The discussion around AirLLM highlights both excitement and practical considerations. Community members have pointed out potential bottlenecks like PCIe bandwidth and suggested optimizations. The project's roadmap, as seen in its changelog, shows continuous evolution with support for new models and features like CPU inference.

围绕 AirLLM 的讨论既体现了兴奋之情，也包含了实际考量。社区成员指出了 PCIe 带宽等潜在瓶颈，并提出了优化建议。从其更新日志看，项目的路线图显示其持续发展，支持新模型和 CPU 推理等功能。

AirLLM's core implication is the democratization of large-scale AI. It enables experimentation, education, and small-scale deployment on consumer-grade hardware, lowering the barrier to entry. Future improvements in inference speed and integration with other high-performance frameworks could significantly broaden its applicability and impact.

AirLLM 的核心意义在于其推动了大规 AI 的民主化。它使得在消费级硬件上进行实验、教育和小规模部署成为可能，降低了入门门槛。未来在推理速度上的改进以及与其他高性能框架的集成，可能会极大地扩展其适用性和影响力。

Conclusion

AirLLM stands as a testament to open-source innovation, demonstrating that substantial LLMs can operate on limited hardware through intelligent system design. While current speed limitations require consideration for production use, it provides an invaluable tool for learning, prototyping, and accessing state-of-the-art models. Developers and researchers are encouraged to explore the GitHub repository, contribute to its development, and join the ongoing conversation about efficient inference.

AirLLM 是开源创新的一个证明，它表明通过智能的系统设计，大型 LLM 可以在有限的硬件上运行。虽然目前的速度限制使其在生产用途上需要斟酌，但它为学习、原型设计和访问最先进的模型提供了一个宝贵的工具。我们鼓励开发者和研究人员探索其 GitHub 仓库，为其发展做出贡献，并加入关于高效推理的持续讨论。

Links:

GitHub Repository: https://github.com/lyogavin/airllm
Original X Post: https://x.com/0x0SojalSec/status/2006060751589622043
Quantization Research: https://arxiv.org/abs/2212.09720
MacOS Example: https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb

链接：

GitHub 仓库: https://github.com/lyogavin/airllm

原始 X 帖子: https://x.com/0x0SojalSec/status/2006060751589622043

量化研究: https://arxiv.org/abs/2212.09720

MacOS 示例: https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb