如何用73美元训练GPT-2级模型？ | nanochat

概述

nanochat 是用于训练大型语言模型（LLM）的最简实验框架。它设计为在单 GPU 节点上运行，代码极简且易于修改，涵盖了 LLM 的所有主要阶段，包括分词、预训练、微调、评估、推理以及聊天界面。例如，你可以仅用 73 美元（在 8XH100 GPU 节点上训练 3 小时）训练出具备 GPT-2 能力的 LLM（该模型在 2019 年的训练成本约为 5 万美元），然后在一个类似 ChatGPT 的熟悉网页界面中与它对话。

nanochat 是用于训练大型语言模型（LLM）的最简实验框架。它设计为在单 GPU 节点上运行，代码极简且易于修改，涵盖了 LLM 的所有主要阶段，包括分词、预训练、微调、评估、推理以及聊天界面。例如，你可以仅用 73 美元（在 8XH100 GPU 节点上训练 3 小时）训练出具备 GPT-2 能力的 LLM（该模型在 2019 年的训练成本约为 5 万美元），然后在一个类似 ChatGPT 的熟悉网页界面中与它对话。

关于代码库的问题，建议使用 Devin/Cognition 的 DeepWiki 提问，或使用 Discussions 标签页，或访问 Discord 上的 #nanochat 频道。

For questions about the repository, it is recommended to use DeepWiki from Devin/Cognition, the Discussions tab, or the #nanochat channel on Discord.

排行榜

我们关注的主要指标是“达到 GPT-2 水平的时间”——即在 8XH100 GPU 节点上，模型性能超越 GPT-2（1.6B）CORE 指标所需的实际用时。2019 年，GPT-2 的训练成本约为 5 万美元。令人惊叹的是，经过 7 年来整个技术栈的诸多进步，我们现在可以在 3 小时或更短时间内，以约 73 美元或更低的成本完成这一目标。

The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. In 2019, the training of GPT-2 cost approximately $50,000. It is incredible that due to many advances over 7 years across the stack, we can now do so in 3 hours or less, for ~$73 and below.

#	记录时间	描述	日期	提交	贡献者
1	3.04 小时	d24 基线，略微过拟合	2026年1月29日	348fbb3	@karpathy

# Record time Description Date Commit Contributors

1 3.04 hours d24 baseline, slightly overtrained Jan 29 2026 348fbb3 @karpathy

#	Record time	Description	Date	Commit	Contributors
1	3.04 hours	d24 baseline, slightly overtrained	Jan 29 2026	348fbb3	@karpathy

设置好代码库后（参考 runs/speedrun.sh 脚本），启动训练的方式如下（例如我启动 jan29 运行的方式）：

Once your repository is set up (see the runs/speedrun.sh script for reference), you can kick off training as follows (e.g., the way I kicked off the jan29 run):

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --run=d24-jan29 \
    --model-tag=d24_jan29 \
    --device-batch-size=16 \
    --sample-every=-1 \
    --save-every=-1 \
    --core-metric-max-per-task=-1 \
    --core-metric-every=3000 \
    --target-param-data-ratio=12

经过 3 小时，我们得到如下输出：

After 3 hours we get output like this:

...
wandb: Run summary:
wandb:          core_metric 0.25851
wandb:                 step 16704
wandb: total_training_flops 4.330784131228946e+19
wandb:  total_training_time 10949.46713

GPT-2 的 CORE 分数（即需要超越的目标）是 0.256525。我们看到这个 d24 模型的 CORE 分数更高（0.25851）。然后我们查看 total_training_time，这是纯训练迭代的时间（不包括评估和日志记录），单位为秒。我们得到：10949/60/60 ≈ 3.04 小时，这是当前的记录。

The GPT-2 CORE score (i.e., the target to beat) is 0.256525. We see that this d24 CORE score is higher (0.25851). Then we look at the total_training_time, which is the time of the training iterations alone, excluding all evaluations and logging, in seconds. We get: 10949/60/60 ~= 3.04 hours, the current record.

快速开始

复现并与 GPT-2 对话

最有意思的事情就是训练你自己的 GPT-2 并与之对话。完成这一过程的完整流程都包含在单个文件 runs/speedrun.sh 中，该脚本设计在 8XH100 GPU 节点上运行。目前，这类节点的价格约为每小时 24 美元，预训练一个 GPT-2 级别的模型大约需要 3 小时，花费约 75 美元。

The most fun you can have is to train your own GPT-2 and talk to it. The entire pipeline to do so is contained in the single file runs/speedrun.sh, which is designed to be run on an 8XH100 GPU node. Currently, at ~$24/hour for these nodes, pretraining a GPT-2 grade model takes approximately 3 hours and will set you back about $75.

从你喜欢的提供商（例如我使用并推荐 Lambda）启动一个新的 8XH100 GPU 实例，然后运行训练脚本：

Boot up a new 8XH100 GPU instance from your favorite provider (e.g., I use and like Lambda), and kick off the training script:

bash runs/speedrun.sh

建议在 screen 会话中运行，因为这大约需要 3 小时。完成后，你可以通过类似 ChatGPT 的网页界面与模型对话。请确保你的本地 uv 虚拟环境已激活（运行 source .venv/bin/activate），然后启动服务：

You may wish to do so in a screen session as this will take ~3 hours to run. Once it's done, you can talk to it via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run source .venv/bin/activate), and serve it:

python -m scripts.chat_web

然后访问显示的 URL。请确保正确访问，例如在 Lambda 上使用你所在节点的公网 IP 地址，后跟端口号，例如 http://209.20.xxx.xxx:8000/。之后，你就可以像平时与 ChatGPT 对话一样与你的 LLM 交流了！让它写故事或诗歌。问它你是谁，看看它的幻觉。问它天空为什么是蓝色的。或者为什么是绿色的。这个快速训练模型是一个 4e19 FLOPs 能力的模型，所以有点像在和幼儿园小朋友聊天 :)

Then visit the URL shown. Make sure to access it correctly, e.g., on Lambda use the public IP of the node you're on, followed by the port, so for example http://209.20.xxx.xxx:8000/, etc. Then talk to your LLM as you'd normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it's green. The speedrun is a 4e19 FLOPs capability model so it's a bit like talking to a kindergartener :).

研究

如果你是研究人员，并希望帮助改进 nanochat，有两个脚本值得关注：runs/scaling_laws.sh 和 runs/miniseries.sh。相关文档请参见 Jan 7 miniseries v1。对于快速实验（约 5 分钟的预训练运行），我最喜欢的规模是训练一个 12 层的模型（GPT-1 大小），例如：

If you are a researcher and wish to help improve nanochat, two scripts of interest are runs/scaling_laws.sh and runs/miniseries.sh. See Jan 7 miniseries v1 for related documentation. For quick experimentation (~5 min pretraining runs) my favorite scale is to train a 12-layer model (GPT-1 sized), e.g., like this:

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12" \
    --model-tag="d12" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1 \

这使用了 wandb（运行名称为 "d12"），仅在最后一步运行 CORE 指标评估，并且不会采样和保存中间检查点。我喜欢在代码中修改一些内容，重新运行一个 d12（或 d16 等）模型，并在迭代循环中查看是否有所帮助。

This uses wandb (run name "d12"), only runs the CORE metric on the last step, and it doesn't sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc.) and see if it helped, in an iteration loop.

整体方法是将模型的深度作为唯一的复杂度调节旋钮。通过改变深度，我们得到能力逐渐增强的模型。我们确定缩放定律，将数据预算设置为计算最优设置，训练一系列规模递增的模型，并将它们与 GPT-2 和 GPT-3 的系列模型进行比较。目前，更快地超越 GPT-2 是最有趣的目标。

The overall approach is to treat the depth of the model as the single dial of complexity. By sweeping out the depth, we get increasingly more powerful models. We determine the scaling laws, set the data budget to a compute optimal setting, train a whole miniseries of models of increasing sizes, and compare them to the GPT-2 and GPT-3 miniseries. Right now, beating GPT-2 specifically faster and faster is the most interesting target.

在 CPU / MPS 上运行

runs/runcpu.sh 脚本展示了一个在 CPU 或 Apple Silicon 上运行的非常简单示例。它会大幅缩小正在训练的 LLM 规模，以便在几十分钟的合理训练时间内完成。通过这种方式你不会得到强大的结果。

The script runs/runcpu.sh shows a very simple example of running on CPU or Apple Silicon. It dramatically shrinks the LLM that is being trained to make things fit into a reasonable time interval of a few ten minutes of training. You will not get strong results in this way.

指南

我发布了一些可能包含有用信息的指南：

I've published a number of guides that might contain helpful information:

2025年10月13日 最初的 nanochat 介绍文章，不过现在包含了一些过时的信息，并且模型比当前主分支的模型旧得多（结果也更差）。

Oct 13 2025 original nanochat post introducing nanochat, though now it contains some deprecated information and the model is a lot older (with worse results) than current master.
Jan 7 miniseries v1 记录了第一个 nanochat 模型系列。

Jan 7 miniseries v1 documents the first nanochat miniseries of models.
要自定义你的 nanochat，请参阅 Discussions 中的 指南：为你的 nanochat 注入身份，其中描述了如何通过合成数据生成并将该数据混合到 SFT 阶段来调整 nanochat 的个性。

To customize your nanochat, see Guide: infusing identity to your nanochat in Discussions, which describes how you can tune your nanochat's personality through synthetic data generation and mixing that data into the SFT stage.
要为 nanochat 添加新能力，请参阅 指南：计算草莓中的字母 r（以及如何添加能力的一般方法）。

To add new abilities to nanochat, see Guide: counting r in strawberry (and how to add abilities generally).

文件结构

.
├── LICENSE
├── README.md
├── dev
│   ├── gen_synthetic_data.py       # 用于身份注入的示例合成数据
│   ├── generate_logo.html
│   ├── nanochat.png
│   └── repackage_data_reference.py # 预训练数据分片生成
├── nanochat
│   ├── __init__.py                 # 空文件
│   ├── checkpoint_manager.py       # 保存/加载模型检查点
│   ├── common.py                   # 杂项小工具，提升生活质量
│   ├── core_eval.py                # 评估基础模型 CORE 分数（DCLM 论文）
│   ├── dataloader.py               # 分词分布式数据加载器
│   ├── dataset.py                  # 预训练数据的下载/读取工具
│   ├── engine.py                   # 使用 KV 缓存的高效模型推理
│   ├── execution.py                # 允许 LLM 将 Python 代码作为工具执行
│   ├── gpt.py                      # GPT nn.Module Transformer
│   ├── logo.svg
│   ├── loss_eval.py                # 评估每字节比特数（替代损失）
│   ├── optim.py                    # AdamW + Muon 优化器，支持单 GPU 和分布式
│   ├── report.py                   # 编写 nanochat 报告的工具
│   ├── tokenizer.py                # BPE 分词器包装器，风格类似 GPT-4
│   └── ui.html                     # nanochat 前端的 HTML/CSS/JS
├── pyproject.toml
├── runs
│   ├── miniseries.sh               # 模型系列训练脚本
│   ├── runcpu.sh                   # 如何在 CPU/MPS 上运行的小示例
│   ├── scaling_laws.sh             # 缩放定律实验
│   └── speedrun.sh                 # 训练约 100 美元的 nanochat d20 模型
├── scripts
│   ├── base_eval.py                # 基础模型：CORE 分数，每字节比特数，采样
│   ├── base_train.py               # 基础模型：训练
│   ├── chat_cli.py                 # 聊天模型：通过 CLI 对话
│   ├── chat_eval.py                # 聊天模型：评估任务
│   ├── chat_rl.py                  # 聊天模型：强化学习
│   ├── chat_sft.py                 # 聊天模型：训练 SFT
│   ├── chat_web.py                 # 聊天模型：通过 WebUI 对话
│   ├── tok_eval.py                 # 分词器：评估压缩率
│   └── tok_train.py                # 分词器：训练分词器
├── tasks
│   ├── arc.py                      # 多项选择科学问题
│   ├── common.py                   # TaskMixture | TaskSequence
│   ├── customjson.py               # 从任意 jsonl 对话创建任务
│   ├── gsm8k.py                    # 8K 小学数学问题
│   ├── humaneval.py                # 误称；简单的 Python 编码任务
│   ├── mml

nanochat：仅需73美元，3小时训练GPT-2级别大语言模型

概述

最新动态

排行榜

快速开始

复现并与 GPT-2 对话

更多注意事项

研究

在 CPU / MPS 上运行

指南

文件结构