如何运行LLaMA2-13B？苹果MLX部署指南

A Deep Dive into LLaMA 2-13B: Architecture, Implementation, and Deployment on Apple Silicon

本文旨在提供一份关于 Meta LLaMA 2-13B 模型全面且深入的技术分析。我们将结合其理论架构与基于 Apple MLX 框架的具体代码实现，进行细致的剖析，涵盖从环境搭建、权重获取到核心模块实现和性能分析的完整流程。通过本文，您将能够透彻理解该模型的所有关键细节。

This article aims to provide a comprehensive and in-depth technical analysis of the Meta LLaMA 2-13B model. We will conduct a detailed dissection by combining its theoretical architecture with the specific code implementation based on the Apple MLX framework, covering the complete process from environment setup and weight acquisition to core module implementation and performance analysis. Through this article, you will gain a thorough understanding of all the key details of this model.

文章结构

Article Structure

本文内容将按以下章节展开：

The content of this article will be organized into the following sections:

环境构建：搭建基于 Conda 和 Apple MLX 的运行环境。
LLaMA 2 的申请与下载：获取模型权重文件。
LLaMA 2 的结构分析：解析模型的宏观 Transformer 架构。
LLaMA 2 的代码分析：深入解读 MLX 实现中的关键模块。
LLaMA 2 的参数与算力分析：估算模型参数量与计算需求。

Environment Setup: Setting up the runtime environment based on Conda and Apple MLX.

LLaMA 2 Application and Download: Acquiring the model weight files.

LLaMA 2 Structural Analysis: Parsing the model's macro Transformer architecture.

LLaMA 2 Code Analysis: In-depth interpretation of key modules in the MLX implementation.

LLaMA 2 Parameter and Computational Analysis: Estimating model parameter count and computational requirements.

环境构建

Environment Setup

Conda 环境准备

Conda Environment Preparation

首先，我们创建一个独立的 Python 环境。

First, we create an isolated Python environment.

conda create -n mlxtest python=3.10 -y
conda activate mlxtest

MLX 安装

MLX Installation

MLX 是 Apple 为 Apple Silicon 芯片优化的数组计算框架。我们将从官方仓库克隆并安装。

MLX is an array computation framework optimized by Apple for Apple Silicon chips. We will clone and install it from the official repository.

git clone https://github.com/ml-explore/mlx.git
cd mlx
pip install -e .

我们主要分析的 LLaMA 示例代码位于 mlx-examples/llms/llama 目录下。

The LLaMA example code we will primarily analyze is located in the mlx-examples/llms/llama directory.

LLaMA 2 代码下载（可选）

LLaMA 2 Code Download (Optional)

为了进行权重转换，您可能需要官方的 LLaMA 仓库（主要用于下载脚本）。

For weight conversion, you may need the official LLaMA repository (primarily for the download script).

git clone https://github.com/facebookresearch/llama.git

LLaMA 2 权重申请及下载

LLaMA 2 Weight Application and Download

目前，从 Meta AI 官网申请 LLaMA 2 权重已非常便捷。

Currently, applying for LLaMA 2 weights from the Meta AI website is very convenient.

访问官方下载页面：https://ai.meta.com/resources/models-and-libraries/llama-downloads/
提交申请并使用真实邮箱接收下载链接。链接通常在 24 小时内有效。

Visit the official download page: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Submit the application and use a real email address to receive the download link. The link is usually valid within 24 hours.

收到邮件后，进入官方 llama 仓库目录运行下载脚本。

After receiving the email, navigate to the official llama repository directory and run the download script.

cd llama
bash download.sh

运行脚本后，根据提示输入邮件中获得的 PRESIGNED_URL 并选择模型大小（例如 13B）。您也可以直接修改 download.sh 脚本中的对应变量以自动化此过程。

After running the script, enter the PRESIGNED_URL obtained from the email and select the model size (e.g., 13B) as prompted. You can also directly modify the corresponding variables in the download.sh script to automate this process.

LLaMA 2 权重转换与运行

LLaMA 2 Weight Conversion and Execution

下载的 PyTorch 格式权重需要转换为 MLX 格式。

The downloaded PyTorch format weights need to be converted to MLX format.

# 假设您在 mlx 项目根目录
# Assuming you are at the root of the mlx project
cd mlx-examples/llms/llama

# 进行权重转换，--torch-path 指向下载的 PyTorch 权重目录
# Perform weight conversion, --torch-path points to the downloaded PyTorch weight directory
python convert.py --torch-path ~/path/to/your/llama-2-13b -q

转换完成后，即可使用 MLX 运行模型。

After conversion is complete, you can run the model using MLX.

python llama.py --prompt "Hello, world!"

至此，您已在 Mac 上成功部署了一套可运行的 LLaMA 2-13B 模型。

At this point, you have successfully deployed a runnable LLaMA 2-13B model on your Mac.

LLaMA 2 的结构分析

LLaMA 2 Structural Analysis

LLaMA 2 遵循经典的仅解码器（Decoder-Only）Transformer 架构。其核心计算流程可概括为：

LLaMA 2 follows the classic Decoder-Only Transformer architecture. Its core computational flow can be summarized as:

输入嵌入（Input Embedding）：将输入词元（Token）映射为高维向量。
多层 Transformer 块（Transformer Block）：
- 注意力机制（Attention）：计算 softmax(Q @ K^T / sqrt(d)) @ V，其中 Q, K, V 由输入通过线性层生成。
- 前馈网络（Feed-Forward Network, FFN）：对注意力输出进行非线性变换。
- 每个子层（注意力、FFN）前后都包含层归一化（RMSNorm） 和残差连接（Residual Connection）。
输出层（Output Layer）：将最终隐藏状态映射回词表空间，生成下一个词元的概率分布。

Input Embedding: Maps input tokens to high-dimensional vectors.

Multi-layer Transformer Block:

Attention Mechanism: Computes softmax(Q @ K^T / sqrt(d)) @ V, where Q, K, V are generated from the input via linear layers.

Feed-Forward Network (FFN): Applies non-linear transformation to the attention output.

Each sublayer (Attention, FFN) is preceded and followed by Layer Normalization (RMSNorm) and Residual Connection.

Output Layer: Maps the final hidden state back to the vocabulary space to generate the probability distribution for the next token.

从 MLX 实现的模型权重结构中，我们可以清晰地看到这一架构：

From the model weight structure implemented in MLX, we can clearly see this architecture:

# Transformer 块 (共40层) 示例
# Transformer Block (40 layers total) example
TransformerBlock(
    (attention): Attention(
      (wq): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # 查询投影
      (wk): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # 键投影
      (wv): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # 值投影
      (wo): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # 输出投影
      (rope): RoPE(128, traditional=True)                          # 旋转位置编码
    )
    (feed_forward): FeedForward(
      (w1): QuantizedLinear(input_dims=5120, output_dims=13824, ...)
      (w2): QuantizedLinear(input_dims=13824, output_dims=5120, ...)
      (w3): QuantizedLinear(input_dims=5120, output_dims=13824, ...)
    )
    (attention_norm): RMSNorm()
    (ffn_norm): RMSNorm()
  )

# 其他关键权重
# Other key weights
(tok_embeddings): Embedding(32000, 5120) # 词嵌入层
(norm): RMSNorm()                         # 最终层归一化
(output): QuantizedLinear(input_dims=5120, output_dims=32000, ...) # 输出层

# Transformer Block (40 layers total) example
TransformerBlock(
    (attention): Attention(
      (wq): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # Query projection
      (wk): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # Key projection
      (wv): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # Value projection
      (wo): QuantizedLinear(input_dims=5120, output_dims=5120, ...) # Output projection
      (rope): RoPE(128, traditional=True)                          # Rotary Positional Encoding
    )
    (feed_forward): FeedForward(
      (w1): QuantizedLinear(input_dims=5120, output_dims=13824, ...)
      (w2): QuantizedLinear(input_dims=13824, output_dims=5120, ...)
      (w3): QuantizedLinear(input_dims=5120, output_dims=13824, ...)
    )
    (attention_norm): RMSNorm()
    (ffn_norm): RMSNorm()
  )

# Other key weights
(tok_embeddings): Embedding(32000, 5120) # Token embedding layer
(norm): RMSNorm()                         # Final layer normalization
(output): QuantizedLinear(input_dims=5120, output_dims=32000, ...) # Output layer

LLaMA 2 的代码分析

LLaMA 2 Code Analysis

模型配置

Model Configuration

模型的核心超参数定义于 ModelArgs 数据类及配置文件中。

The core hyperparameters of the model are defined in the ModelArgs dataclass and configuration file.

配置文件 (config.json):

Configuration File (config.json):

{
    "dim": 5120,        // 隐藏层维度 / Hidden dimension
    "n_heads": 40,      // 注意力头数量 / Number of attention heads
    "n_layers": 40,     // Transformer 层数 / Number of Transformer layers
    "norm_eps": 1e-05,  // 归一化层 epsilon / Normalization layer epsilon
    "vocab_size": -1,   // 词表大小（从权重加载） / Vocabulary size (loaded from weights)
    "model_type": "llama",
    "quantization": {   // 量化配置 / Quantization configuration
        "group_size": 64,
        "bits": 4
    }
}

核心模块实现

Core Module Implementation

1. 嵌入层 (Embedding)

1. Embedding Layer

嵌入层是一个简单的查找表，将整数词元 ID 映射为 dim 维的向量。

The embedding layer is a simple lookup table that maps integer token IDs to vectors of dimension dim.

self.tok_embeddings = nn.Embedding(args.vocab_size, args.dim)
# 实际权重形状: [32000, 5120]
# Actual weight shape: [32000, 5120]

2. 注意力机制 (Attention)

2. Attention Mechanism

Attention 模块实现了多头注意力，并集成了旋转位置编码（RoPE）和 KV 缓存。

The Attention module implements multi-head attention and integrates Rotary Positional Encoding (RoPE) and KV caching.

关键步骤:

Key Steps:

线性投影：通过 wq, wk, wv 生成 Q, K, V。
重塑与转置：将 Q, K, V 重塑为多头格式 [Batch, Num_Heads, Sequence_Length, Head_Dim]。
应用 RoPE：对 Q 和 K 应用旋转位置编码。
KV 缓存：在自回归生成时，将当前步的 K, V 与缓存的 K, V 拼接，以重用历史信息。
注意力计算：执行 softmax((Q @ K^T) / sqrt(d_head)) @ V。
输出投影：通过 wo 线性层将多头输出合并。

Linear Projection: Generate Q, K, V via wq, wk, wv.

Reshape and Transpose: Reshape Q, K, V into multi-head format [Batch, Num_Heads, Sequence_Length, Head_Dim].

Apply RoPE: Apply rotary positional encoding to Q and K.

KV Caching: During autoregressive generation, concatenate the current step's K, V with cached K, V to reuse historical information.

Attention Calculation: Perform softmax((Q @ K^T) / sqrt(d_head)) @ V.

Output Projection: Merge multi-head output via the wo linear layer.

KV 缓存实现片段:

KV Caching Implementation Snippet:

if cache is not None:
    key_cache, value_cache = cache
    queries = self.rope(queries, offset=key_cache.shape[2])
    keys = self.rope(keys, offset=key_cache.shape[2])
    keys = mx.concatenate([key_cache, keys], axis=2)   # 沿序列维度拼接 / Concatenate along sequence dimension
    values = mx.concatenate([value_cache, values], axis=2)

3. 前馈网络 (FeedForward)

3. FeedForward Network

LLaMA 使用了 SwiGLU 激活函数的变体作为前馈网络。

LLaMA uses a variant of the SwiGLU activation function as its feedforward network.

def __call__(self, x) -> mx.array:
    return self.w2(nn.silu(self.w1(x)) * self.w3(x))  # SwiGLU: silu(x @ W1) * (x @ W3)

4. RMSNorm

4. RMSNorm

LLaMA 使用 RMSNorm 替代了传统的 LayerNorm，计算更简单。

LLaMA uses RMSNorm instead of traditional LayerNorm, which is computationally simpler.

def _norm(self, x):
    # 计算均方根倒数进行归一化 / Normalize by computing the reciprocal of the root mean square
    return x * mx.rsqrt(x.square().mean(-1, keepdims=True) + self.eps)

5. 因果掩码 (Causal Mask)

5. Causal Mask

自回归生成需要因果掩码，确保当前位置只能关注到之前的位置。

Autoregressive generation requires a causal mask to ensure the current position can only attend to previous positions.

@staticmethod
def create_additive_causal_mask(N: int):
    indices = mx.arange(N)
    # 创建下三角布尔矩阵（True 表示需要掩码的位置）
    # Create a lower triangular boolean matrix (True indicates positions to be masked)
    mask = indices[:, None] < indices[None]
    mask = mask.astype(dtype) * -1e9  # 将需要掩码的位置设为很大的负值
    return mask                       # Set masked positions to a large negative value

生成过程 (Generation)

Generation Process

Llama.generate() 方法实现了自回归文本生成。

The Llama.generate() method implements autoregressive text generation.

提示处理：将输入提示通过模型前向传播，并初始化每一层的 KV 缓存。
循环生成：
- 将最新生成的词元作为输入。
- 使用更新的 KV 缓存进行前向传播（无需重新计算历史 K, V）。
- 从输出 logits 中采样下一个词元（温度 temp 控制随机性）。
- temp=0 时使用贪心搜索（argmax），temp>0 时使用多项式采样。

Prompt Processing: Perform a forward pass of the input prompt through the model and initialize the KV cache for each layer.

Loop Generation:

Use the most recently generated token as input.

Perform a forward pass using the updated KV cache (no need to recompute historical K, V).

Sample the next token from the output logits (randomness controlled by temperature temp).

Use greedy search (`