NanoChat:仅需100美元4小时,训练你自己的ChatGPT级AI模型
NanoChat is a comprehensive LLM training framework developed by AI expert Andrej Karpathy, enabling users to train their own ChatGPT-level models for approximately $100 in just 4 hours through an end-to-end, minimalistic codebase. (NanoChat是由AI专家Andrej Karpathy开发的完整LLM训练框架,通过端到端、最小化的代码库,让用户仅需约100美元和4小时即可训练出属于自己的ChatGPT级别模型。)
🎯 什么是 NanoChat?
NanoChat 是由 AI 领域知名专家 Andrej Karpathy 开发的完整 LLM 训练框架。它将从分词、预训练、微调、评估到 Web 部署的整个流程浓缩在一个干净、最小化、可深度定制的代码库中。
NanoChat is a complete LLM training framework developed by renowned AI expert Andrej Karpathy. It condenses the entire pipeline—from tokenization, pre-training, fine-tuning, and evaluation to web deployment—into a clean, minimal, and hackable codebase.
最大的亮点:仅需约 $100 和 4 小时,你就能在单个 8XH100 节点上训练出属于自己的 ChatGPT!
The biggest highlight: For just about $100 and 4 hours, you can train your own ChatGPT on a single 8XH100 node!
✅ 核心特点
- 端到端:覆盖完整 LLM 生命周期 (End-to-End: Covers the complete LLM lifecycle)
- 代码简洁:~8K 行代码,45 个文件 (Concise Code: ~8K lines of code, 45 files)
- 依赖极少:
uv.lock仅 2004 行 (Minimal Dependencies:uv.lockfile is only 2004 lines) - 性能优秀:可达 GPT-2 级别 (Excellent Performance: Can reach GPT-2 level)
- 成本可控:$100 起步,$1000 达到高性能 (Controllable Cost: Starts at $100, reaches high performance at $1000)
🎓 学习价值
- 理解完整 LLM 训练流程 (Understand the complete LLM training pipeline)
- 掌握 Tokenization 原理 (Master the principles of Tokenization)
- 学习预训练+微调技巧 (Learn pre-training and fine-tuning techniques)
- 了解 RLHF 强化学习 (Understand RLHF reinforcement learning)
- 实践模型评估方法 (Practice model evaluation methods)
📊 项目规模
- 333K 字符数 (Character Count)
- 8.3K 代码行数 (Lines of Code)
- 45 文件数 (Number of Files)
- ~83K Token 数 (Token Count)
- 29K GitHub Stars
🚀 快速开始指南
1. 环境准备
硬件需求:
- 推荐:8XH100 80GB GPU(Lambda Labs 租用约 $100)(Recommended: 8XH100 80GB GPU, ~$100 to rent from Lambda Labs)
- 兼容:8XA100 80GB(稍慢但可用)(Compatible: 8XA100 80GB, slightly slower but usable)
- 最低:单 GPU(需调整
batch_size,训练时间 x8)(Minimum: Single GPU, requires adjustingbatch_size, training time x8)
# 克隆仓库 (Clone the repository)
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# 安装依赖(使用 uv 包管理器,快速且现代化)(Install dependencies using the modern uv package manager)
pip install uv
uv sync
# 或使用传统 pip (Or use traditional pip)
pip install -e .
2. 一键运行完整流程
NanoChat 提供了 speedrun.sh 脚本,一键完成从数据准备到模型部署的全部流程(约 4 小时)。
NanoChat provides a
speedrun.shscript that completes the entire process from data preparation to model deployment with one command (approx. 4 hours).
# 运行完整训练流程(speedrun 模式,$100 预算)(Run the complete training pipeline in speedrun mode, $100 budget)
bash speedrun.sh
# 流程包括:(The process includes:)
# 1. 下载训练数据(FineWeb)(Download training data - FineWeb)
# 2. 预训练(Base Model)(Pre-training - Base Model)
# 3. 中期训练(Mid Training)(Mid Training)
# 4. 监督微调(SFT)(Supervised Fine-Tuning - SFT)
# 5. 强化学习(RL/RLHF)(Reinforcement Learning - RL/RLHF)
# 6. 模型评估 (Model Evaluation)
# 7. 启动 Web 服务 (Start Web Service)
预期结果:
- ✓ 总训练时间:约 3 小时 51 分钟 (Total training time: ~3 hours 51 minutes)
- ✓ 总成本:约 $100(Lambda Labs 8XH100)(Total cost: ~$100 on Lambda Labs 8XH100)
- ✓ 模型性能:4e19 FLOPs(类似幼儿园水平😊)(Model performance: 4e19 FLOPs, akin to kindergarten level 😊)
- ✓ 自动生成
report.md评估报告 (Automatically generates anreport.mdevaluation report)
3. 与你的 LLM 对话
训练完成后,启动 Web 界面与模型交互。
After training is complete, start the web interface to interact with the model.
# 启动 Web 服务 (Start the web service)
python -m scripts.chat_web
# 输出示例:(Example output:)
# * Running on http://0.0.0.0:8000
# 访问:http://your-server-ip:8000 (Visit: http://your-server-ip:8000)
💬 测试对话:
- "讲个故事吧" - 测试创意生成 ("Tell me a story" - Test creative generation)
- "你是谁?" - 观察模型幻觉 ("Who are you?" - Observe model hallucinations)
- "为什么天空是蓝色的?" - 测试知识问答 ("Why is the sky blue?" - Test knowledge Q&A)
- "写一首诗" - 测试文学创作 ("Write a poem" - Test literary creation)
4. 查看评估报告
# 查看完整评估报告 (View the complete evaluation report)
cat report.md
# 报告包含:(The report includes:)
# - 各阶段训练指标 (Training metrics for each stage)
# - 多个 benchmark 评估结果 (Evaluation results on multiple benchmarks)
# - 训练时长和成本统计 (Training duration and cost statistics)
示例评估表格:
| Metric | BASE | MID | SFT | RL |
|---|---|---|---|---|
| CORE | 0.2219 | - | - | - |
| GSM8K | - | 0.0250 | 0.0455 | 0.0758 |
| HumanEval | - | 0.0671 | 0.0854 | - |
🏗️ 架构设计
完整训练流水线
- 数据准备 (Data Preparation)
下载 FineWeb 数据集 → 使用 RustBPE 分词器处理 (Download FineWeb dataset → Process with RustBPE tokenizer)
- 预训练 (Pre-training)
Base Model 训练 → 学习语言基础(语法、词汇、常识)(Base Model training → Learn language fundamentals: grammar, vocabulary, common sense)
- 中期训练 (Mid Training)
Mid Training → 在高质量数据上继续训练 (Mid Training → Continue training on high-quality data)
- 监督微调 (Supervised Fine-Tuning)
SFT → 学习对话格式(SmolTalk 数据集)(SFT → Learn conversational format using the SmolTalk dataset)
- 强化学习 (Reinforcement Learning)
RLHF → 通过人类反馈优化回答质量 (RLHF → Optimize answer quality through human feedback)
- 评估部署 (Evaluation & Deployment)
多维度评估 → Web 服务部署 → 实际使用 (Multi-dimensional evaluation → Web service deployment → Practical use)
📁 项目结构
nanochat/
├── nanochat/ # 核心代码 (Core code)
│ ├── model.py # Transformer 模型 (Transformer model)
│ ├── dataset.py # 数据加载 (Data loading)
│ └── utils.py # 工具函数 (Utility functions)
├── rustbpe/ # Rust 实现的 BPE 分词器 (Rust-implemented BPE tokenizer)
├── scripts/ # 训练脚本 (Training scripts)
│ ├── base_train.py # 预训练 (Pre-training)
│ ├── mid_train.py # 中期训练 (Mid training)
│ ├── sft_train.py # 监督微调 (Supervised fine-tuning)
│ ├── rl_train.py # 强化学习 (Reinforcement learning)
│ └── chat_web.py # Web 服务 (Web service)
├── tasks/ # 评估任务 (Evaluation tasks)
├── speedrun.sh # 一键运行脚本 (One-click run script)
└── pyproject.toml # 依赖配置 (Dependency configuration)
🧩 核心组件
- 🔤 RustBPE (RustBPE)
高性能 Rust 分词器,速度快、内存占用低 (High-performance Rust tokenizer, fast and memory-efficient)
- 🧠 Transformer (Transformer)
标准 decoder-only 架构,支持可变深度 (Standard decoder-only architecture, supports variable depth)
- 📊 评估系统支持CORE、GSM8K、HumanEval等多个基准测试的模型评估框架,用于量化模型在不同任务上的性能表现。 (Evaluation System)
支持 CORE、GSM8K、HumanEval 等多个 benchmark (Supports multiple benchmarks: CORE, GSM8K, HumanEval)
- 🌐 Web 界面 (Web Interface)
简洁的 HTML+CSS 对话界面 (Clean HTML+CSS conversational interface)
🔍 核心代码解读
核心 1: Transformer 模型实现
标准的 decoder-only Transformer 架构,代码极简且高效。
A standard decoder-only Transformer architecture, implemented with minimal and efficient code.
# nanochat/model.py (简化版) (Simplified version)
import torch
import torch.nn as nn
from torch.nn import functional as F
class CausalSelfAttention(nn.Module):
"""因果自注意力层 (Causal Self-Attention Layer)"""
def __init__(self, config):
super().__init__()
# Q, K, V 投影 (Q, K, V projections)
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
# 输出投影 (Output projection)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
# Causal mask
self.register_buffer("bias", torch.tril(
torch.ones(config.block_size, config.block_size)
).view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size() # batch, seq_len, embedding_dim
# 计算 Q, K, V (Compute Q, K, V)
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
# 多头注意力 (Multi-head attention)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
# 注意力分数(Flash Attention 优化)(Attention scores with Flash Attention optimization)
att = F.scaled_dot_product_attention(q, k, v, is_causal=True)
# 合并多头 (Combine multi-heads)
y = att.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)
class Block(nn.Module):
"""Transformer Block"""
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
nn.GELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
)
def forward(self, x):
# Pre-norm 架构 (Pre-norm architecture)
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
关键设计:
- Pre-norm:LayerNorm 在 attention/MLP 之前,训练更稳定 (LayerNorm before attention/MLP for more stable training)
- Flash Attention:使用 PyTorch 2.0 的优化实现,速度快 2-3 倍 (Uses PyTorch 2.0's optimized implementation, 2-3x faster)
- 因果 mask:确保只能看到之前的 token,实现自回归生成 (Causal mask ensures only previous tokens are visible, enabling autoregressive generation)
- 可配置深度:通过
--depth参数调整模型大小 (Configurable depth: adjust model size via the--depthparameter)
核心 2: 数据加载与处理
# nanochat/dataset.py (简化版) (Simplified version)
class DataLoader:
def __init__(self, split, batch_size, seq_len, device):
self.batch_size = batch_size
self.seq_len = seq_len
self.device = device
# 加载数据 shard (Load data shards)
self.shards = self._load_shards(split)
self.reset()
def _load_shards(self, split):
"""从磁盘加载预处理的数据文件 (Load preprocessed data files from disk)"""
data_dir = Path("data") / split
shards = sorted(data_dir.glob("*.npy"))
return shards
def reset(self):
"""重置迭代器 (Reset the iterator)"""
self.current_shard = 0
self.current_position = 0
self.tokens = np.load(self.shards[self.current_shard])
def next_batch(self):
"""获取下一个 batch (Get the next batch)"""
B, T = self.batch_size, self.seq_len
buf = self.tokens[self.current_position : self.current_position + B*T + 1]
# X: 输入序列,Y: 目标序列(右移 1 位)(X: input sequence, Y: target sequence shifted right by 1)
x = torch.tensor(buf[:-1], dtype=torch.long).view(B, T)
y = torch.tensor(buf[1:], dtype=torch.long).view(B, T)
# 移动到 GPU (Move to GPU)
x, y = x.to(self.device), y.to(self.device)
# 更新位置 (Update position)
self.current_position += B * T
if self.current_position + B*T + 1 > len(self.tokens):
self._load_next_shard()
return x, y
优化要点:
- Shard 分片:数据分成多个文件,避免一次加载全部到内存 (Data split into multiple files to avoid loading everything into memory at once)
- 预分词:训练前完成分词,避免训练时重复计算 (Pre-tokenization: tokenization done before training to avoid repeated computation)
- numpy 存储:使用
.npy格式,加载速度快 (Uses.npyformat for fast loading) - 连续采样:按顺序读取,提高缓存命中率 (Sequential sampling: reads in order to improve cache hit rate)
核心 3: RustBPE 分词器
使用 Rust 实现的 BPE(Byte Pair Encoding)分词器,性能优异。
A BPE (Byte Pair Encoding) tokenizer implemented in Rust, offering excellent performance.
为什么选 Rust?
- 速度快:比 Python 快 10-100 倍 (Fast: 10-100x faster than Python)
- 内存安全:编译时保证无内存泄漏 (Memory-safe: guarantees no memory leaks at compile time)
- 并行化:轻松利用多核 CPU (Easy parallelization: leverages multi-core CPUs)
- 部署方便:编译成二进制,无运行时依赖 (Easy deployment: compiled to binary, no runtime dependencies)
BPE 原理
- 从字符级别开始 (Start at the character level)
- 统计相邻字符对频率 (Count the frequency of adjacent character pairs)
- 合并最高频的字符对 (Merge the most frequent character pair)
- 重复直到达到词表大小 (Repeat until the vocabulary size is reached)
# Python 调用 Rust 分词器 (Python calls the Rust tokenizer)
from nanochat.rustbpe import RustBPE
# 加载或训练分词器 (Load or train the tokenizer)
tokenizer = RustBPE(vocab_size=50257) # GPT-2 词表大小 (GPT-2 vocabulary size)
tokenizer.train(texts, verbose=True)
# 编码 (Encode)
text = "Hello, world!"
tokens = tokenizer.encode(text) # [15496, 11, 995, 0]
# 解码 (Decode)
decoded = tokenizer.decode(tokens) # "Hello, world!"
⚙️ 详细训练流程
阶段 1:预训练(Base Model)
目标:学习语言的基础知识和模式
Goal: Learn the foundational knowledge and patterns of language.
训练配置
# scripts/base_train.py
torchrun --standalone --nproc_per_node=8 \
-m scripts.base_train \
--depth=14 \ # 模型深度 (Model depth)
--device_batch_size=32 \ # 每 GPU 的 batch size (Batch
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。