如何用100美元训练ChatGPT级AI？NanoChat教程

🎯 什么是 NanoChat？

NanoChat 是由 AI 领域知名专家 Andrej Karpathy 开发的完整 LLM 训练框架。它将从分词、预训练、微调、评估到 Web 部署的整个流程浓缩在一个干净、最小化、可深度定制的代码库中。

NanoChat is a complete LLM training framework developed by renowned AI expert Andrej Karpathy. It condenses the entire pipeline—from tokenization, pre-training, fine-tuning, and evaluation to web deployment—into a clean, minimal, and hackable codebase.

最大的亮点：仅需约 $100 和 4 小时，你就能在单个 8XH100 节点上训练出属于自己的 ChatGPT！

The biggest highlight: For just about $100 and 4 hours, you can train your own ChatGPT on a single 8XH100 node!

✅ 核心特点

端到端：覆盖完整 LLM 生命周期 (End-to-End: Covers the complete LLM lifecycle)
代码简洁：~8K 行代码，45 个文件 (Concise Code: ~8K lines of code, 45 files)
依赖极少：uv.lock 仅 2004 行 (Minimal Dependencies: uv.lock file is only 2004 lines)
性能优秀：可达 GPT-2 级别 (Excellent Performance: Can reach GPT-2 level)
成本可控：$100 起步，$1000 达到高性能 (Controllable Cost: Starts at $100, reaches high performance at $1000)

🎓 学习价值

理解完整 LLM 训练流程 (Understand the complete LLM training pipeline)
掌握 Tokenization 原理 (Master the principles of Tokenization)
学习预训练+微调技巧 (Learn pre-training and fine-tuning techniques)
了解 RLHF 强化学习 (Understand RLHF reinforcement learning)
实践模型评估方法 (Practice model evaluation methods)

📊 项目规模

333K 字符数 (Character Count)
8.3K 代码行数 (Lines of Code)
45 文件数 (Number of Files)
~83K Token 数 (Token Count)
29K GitHub Stars

🚀 快速开始指南

1. 环境准备

硬件需求：

推荐：8XH100 80GB GPU（Lambda Labs 租用约 $100）(Recommended: 8XH100 80GB GPU, ~$100 to rent from Lambda Labs)
兼容：8XA100 80GB（稍慢但可用）(Compatible: 8XA100 80GB, slightly slower but usable)
最低：单 GPU（需调整 batch_size，训练时间 x8）(Minimum: Single GPU, requires adjusting batch_size, training time x8)

# 克隆仓库 (Clone the repository)
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# 安装依赖（使用 uv 包管理器，快速且现代化）(Install dependencies using the modern uv package manager)
pip install uv
uv sync

# 或使用传统 pip (Or use traditional pip)
pip install -e .

2. 一键运行完整流程

NanoChat 提供了 speedrun.sh 脚本，一键完成从数据准备到模型部署的全部流程（约 4 小时）。

NanoChat provides a speedrun.sh script that completes the entire process from data preparation to model deployment with one command (approx. 4 hours).

# 运行完整训练流程（speedrun 模式，$100 预算）(Run the complete training pipeline in speedrun mode, $100 budget)
bash speedrun.sh

# 流程包括：(The process includes:)
# 1. 下载训练数据（FineWeb）(Download training data - FineWeb)
# 2. 预训练（Base Model）(Pre-training - Base Model)
# 3. 中期训练（Mid Training）(Mid Training)
# 4. 监督微调（SFT）(Supervised Fine-Tuning - SFT)
# 5. 强化学习（RL/RLHF）(Reinforcement Learning - RL/RLHF)
# 6. 模型评估 (Model Evaluation)
# 7. 启动 Web 服务 (Start Web Service)

预期结果：

✓ 总训练时间：约 3 小时 51 分钟 (Total training time: ~3 hours 51 minutes)
✓ 总成本：约 $100（Lambda Labs 8XH100）(Total cost: ~$100 on Lambda Labs 8XH100)
✓ 模型性能：4e19 FLOPs（类似幼儿园水平😊）(Model performance: 4e19 FLOPs, akin to kindergarten level 😊)
✓ 自动生成 report.md 评估报告 (Automatically generates an report.md evaluation report)

3. 与你的 LLM 对话

训练完成后，启动 Web 界面与模型交互。

After training is complete, start the web interface to interact with the model.

# 启动 Web 服务 (Start the web service)
python -m scripts.chat_web

# 输出示例：(Example output:)
# * Running on http://0.0.0.0:8000
# 访问：http://your-server-ip:8000 (Visit: http://your-server-ip:8000)

💬 测试对话：

"讲个故事吧" - 测试创意生成 ("Tell me a story" - Test creative generation)
"你是谁？" - 观察模型幻觉 ("Who are you?" - Observe model hallucinations)
"为什么天空是蓝色的？" - 测试知识问答 ("Why is the sky blue?" - Test knowledge Q&A)
"写一首诗" - 测试文学创作 ("Write a poem" - Test literary creation)

4. 查看评估报告

# 查看完整评估报告 (View the complete evaluation report)
cat report.md

# 报告包含：(The report includes:)
# - 各阶段训练指标 (Training metrics for each stage)
# - 多个 benchmark 评估结果 (Evaluation results on multiple benchmarks)
# - 训练时长和成本统计 (Training duration and cost statistics)

示例评估表格：

Metric	BASE	MID	SFT	RL
CORE	0.2219	-	-	-
GSM8K	-	0.0250	0.0455	0.0758
HumanEval	-	0.0671	0.0854	-

🏗️ 架构设计

完整训练流水线

数据准备 (Data Preparation)

下载 FineWeb 数据集 → 使用 RustBPE 分词器处理 (Download FineWeb dataset → Process with RustBPE tokenizer)
预训练 (Pre-training)

Base Model 训练 → 学习语言基础（语法、词汇、常识）(Base Model training → Learn language fundamentals: grammar, vocabulary, common sense)
中期训练 (Mid Training)

Mid Training → 在高质量数据上继续训练 (Mid Training → Continue training on high-quality data)
监督微调 (Supervised Fine-Tuning)

SFT → 学习对话格式（SmolTalk 数据集）(SFT → Learn conversational format using the SmolTalk dataset)
强化学习 (Reinforcement Learning)

RLHF → 通过人类反馈优化回答质量 (RLHF → Optimize answer quality through human feedback)
评估部署 (Evaluation & Deployment)

多维度评估 → Web 服务部署 → 实际使用 (Multi-dimensional evaluation → Web service deployment → Practical use)

📁 项目结构

nanochat/
├── nanochat/           # 核心代码 (Core code)
│   ├── model.py       # Transformer 模型 (Transformer model)
│   ├── dataset.py     # 数据加载 (Data loading)
│   └── utils.py       # 工具函数 (Utility functions)
├── rustbpe/           # Rust 实现的 BPE 分词器 (Rust-implemented BPE tokenizer)
├── scripts/           # 训练脚本 (Training scripts)
│   ├── base_train.py  # 预训练 (Pre-training)
│   ├── mid_train.py   # 中期训练 (Mid training)
│   ├── sft_train.py   # 监督微调 (Supervised fine-tuning)
│   ├── rl_train.py    # 强化学习 (Reinforcement learning)
│   └── chat_web.py    # Web 服务 (Web service)
├── tasks/             # 评估任务 (Evaluation tasks)
├── speedrun.sh        # 一键运行脚本 (One-click run script)
└── pyproject.toml     # 依赖配置 (Dependency configuration)

🧩 核心组件

🔤 RustBPE (RustBPE)

高性能 Rust 分词器，速度快、内存占用低 (High-performance Rust tokenizer, fast and memory-efficient)
🧠 Transformer (Transformer)

标准 decoder-only 架构，支持可变深度 (Standard decoder-only architecture, supports variable depth)
📊 评估系统支持CORE、GSM8K、HumanEval等多个基准测试的模型评估框架，用于量化模型在不同任务上的性能表现。 (Evaluation System)

支持 CORE、GSM8K、HumanEval 等多个 benchmark (Supports multiple benchmarks: CORE, GSM8K, HumanEval)
🌐 Web 界面 (Web Interface)

简洁的 HTML+CSS 对话界面 (Clean HTML+CSS conversational interface)

🔍 核心代码解读

核心 1: Transformer 模型实现

标准的 decoder-only Transformer 架构，代码极简且高效。

A standard decoder-only Transformer architecture, implemented with minimal and efficient code.

# nanochat/model.py (简化版) (Simplified version)
import torch
import torch.nn as nn
from torch.nn import functional as F

class CausalSelfAttention(nn.Module):
    """因果自注意力层 (Causal Self-Attention Layer)"""
    def __init__(self, config):
        super().__init__()
        # Q, K, V 投影 (Q, K, V projections)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # 输出投影 (Output projection)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # Causal mask
        self.register_buffer("bias", torch.tril(
            torch.ones(config.block_size, config.block_size)
        ).view(1, 1, config.block_size, config.block_size))
        
    def forward(self, x):
        B, T, C = x.size()  # batch, seq_len, embedding_dim
        
        # 计算 Q, K, V (Compute Q, K, V)
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        
        # 多头注意力 (Multi-head attention)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        
        # 注意力分数（Flash Attention 优化）(Attention scores with Flash Attention optimization)
        att = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        
        # 合并多头 (Combine multi-heads)
        y = att.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)

class Block(nn.Module):
    """Transformer Block"""
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
        )
        
    def forward(self, x):
        # Pre-norm 架构 (Pre-norm architecture)
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

关键设计：

Pre-norm：LayerNorm 在 attention/MLP 之前，训练更稳定 (LayerNorm before attention/MLP for more stable training)
Flash Attention：使用 PyTorch 2.0 的优化实现，速度快 2-3 倍 (Uses PyTorch 2.0's optimized implementation, 2-3x faster)
因果 mask：确保只能看到之前的 token，实现自回归生成 (Causal mask ensures only previous tokens are visible, enabling autoregressive generation)
可配置深度：通过 --depth 参数调整模型大小 (Configurable depth: adjust model size via the --depth parameter)

核心 2: 数据加载与处理

# nanochat/dataset.py (简化版) (Simplified version)
class DataLoader:
    def __init__(self, split, batch_size, seq_len, device):
        self.batch_size = batch_size
        self.seq_len = seq_len
        self.device = device
        
        # 加载数据 shard (Load data shards)
        self.shards = self._load_shards(split)
        self.reset()
    
    def _load_shards(self, split):
        """从磁盘加载预处理的数据文件 (Load preprocessed data files from disk)"""
        data_dir = Path("data") / split
        shards = sorted(data_dir.glob("*.npy"))
        return shards
    
    def reset(self):
        """重置迭代器 (Reset the iterator)"""
        self.current_shard = 0
        self.current_position = 0
        self.tokens = np.load(self.shards[self.current_shard])
    
    def next_batch(self):
        """获取下一个 batch (Get the next batch)"""
        B, T = self.batch_size, self.seq_len
        buf = self.tokens[self.current_position : self.current_position + B*T + 1]
        
        # X: 输入序列，Y: 目标序列（右移 1 位）(X: input sequence, Y: target sequence shifted right by 1)
        x = torch.tensor(buf[:-1], dtype=torch.long).view(B, T)
        y = torch.tensor(buf[1:], dtype=torch.long).view(B, T)
        
        # 移动到 GPU (Move to GPU)
        x, y = x.to(self.device), y.to(self.device)
        
        # 更新位置 (Update position)
        self.current_position += B * T
        if self.current_position + B*T + 1 > len(self.tokens):
            self._load_next_shard()
        
        return x, y

优化要点：

Shard 分片：数据分成多个文件，避免一次加载全部到内存 (Data split into multiple files to avoid loading everything into memory at once)
预分词：训练前完成分词，避免训练时重复计算 (Pre-tokenization: tokenization done before training to avoid repeated computation)
numpy 存储：使用 .npy 格式，加载速度快 (Uses .npy format for fast loading)
连续采样：按顺序读取，提高缓存命中率 (Sequential sampling: reads in order to improve cache hit rate)

核心 3: RustBPE 分词器

使用 Rust 实现的 BPE（Byte Pair Encoding）分词器，性能优异。

A BPE (Byte Pair Encoding) tokenizer implemented in Rust, offering excellent performance.

为什么选 Rust？

速度快：比 Python 快 10-100 倍 (Fast: 10-100x faster than Python)
内存安全：编译时保证无内存泄漏 (Memory-safe: guarantees no memory leaks at compile time)
并行化：轻松利用多核 CPU (Easy parallelization: leverages multi-core CPUs)
部署方便：编译成二进制，无运行时依赖 (Easy deployment: compiled to binary, no runtime dependencies)

BPE 原理

从字符级别开始 (Start at the character level)
统计相邻字符对频率 (Count the frequency of adjacent character pairs)
合并最高频的字符对 (Merge the most frequent character pair)
重复直到达到词表大小 (Repeat until the vocabulary size is reached)

# Python 调用 Rust 分词器 (Python calls the Rust tokenizer)
from nanochat.rustbpe import RustBPE

# 加载或训练分词器 (Load or train the tokenizer)
tokenizer = RustBPE(vocab_size=50257)  # GPT-2 词表大小 (GPT-2 vocabulary size)
tokenizer.train(texts, verbose=True)

# 编码 (Encode)
text = "Hello, world!"
tokens = tokenizer.encode(text)  # [15496, 11, 995, 0]

# 解码 (Decode)
decoded = tokenizer.decode(tokens)  # "Hello, world!"

⚙️ 详细训练流程

阶段 1：预训练（Base Model）

目标：学习语言的基础知识和模式

Goal: Learn the foundational knowledge and patterns of language.

训练配置

# scripts/base_train.py
torchrun --standalone --nproc_per_node=8 \
  -m scripts.base_train \
  --depth=14 \              # 模型深度 (Model depth)
  --device_batch_size=32 \  # 每 GPU 的 batch size (Batch