如何低成本复现ChatGPT？NanoChat全栈架构解析

引言

编辑|云昭卡神（Karpathy）的新项目 NanoChat在技术圈掀起了波澜。这款基于ChatGPT的LLM全栈实现，以其极低的成本和简易的安装流程，吸引了众多开发者和研究人员的目光。

Andrej Karpathy's new project, NanoChat, has stirred up waves in the tech community. This full-stack implementation of a ChatGPT-like LLM, with its extremely low cost and simple installation process, has captured the attention of numerous developers and researchers.

项目的核心在于，仅需8个 H100节点，花费大约100美元，就能训练出一个能够进行对话、写诗、回答简单问题的模型。这种低门槛，无疑为AI技术的普及带来了新的可能性。

The core of the project lies in its ability to train a model capable of conversation, poetry writing, and answering simple questions using just 8 H100 nodes, at a cost of approximately $100. This low barrier to entry undoubtedly brings new possibilities for the democratization of AI technology.

NanoChat的出现，不仅降低了成本，更重要的是，它大大降低了理解 ChatGPT底层原理的门槛。根据 Karpathy在项目自述文件的描述，整个项目的训练过程、使用到的技术都和 OpenAI训练 ChatGPT的方法基本相同。这使得研究人员和开发者能够更深入地了解 LLM的构建过程，并进行个性化的定制和优化。

The emergence of NanoChat not only reduces costs but, more importantly, significantly lowers the barrier to understanding the underlying principles of ChatGPT. According to Karpathy's description in the project's README, the entire training process and technologies used are fundamentally the same as those employed by OpenAI to train ChatGPT. This enables researchers and developers to gain a deeper understanding of the LLM construction process and perform personalized customization and optimization.

核心技术亮点

8304行代码，4小时训练，击败GPT-2，部分指标超越GPT-4？NanoChat的核心在于其全栈实现，从零开始，涵盖了训练、推理的全过程。项目的核心技术亮点包括：

8304 lines of code, 4 hours of training, outperforming GPT-2, with some metrics surpassing GPT-4? The core of NanoChat lies in its full-stack implementation, built from scratch, covering the entire process of training and inference. The key technical highlights of the project include:

全新的Rust实现训练分词器：在 FineWeb上对 Transformer LLM基于Transformer架构的大语言模型，采用自注意力机制和多层感知机（MLP）结构，是NanoChat和ChatGPT等现代对话AI的核心技术基础。进行预训练，并评估多个指标下的 CORE分数。
- A New Rust-Implemented Training Tokenizer: Pre-training a Transformer LLM基于Transformer架构的大语言模型，采用自注意力机制和多层感知机（MLP）结构，是NanoChat和ChatGPT等现代对话AI的核心技术基础。 on FineWeb and evaluating the CORE score across multiple metrics.
中期训练：在来自 SmolTalk的用户-助手对话、多项选择题、工具使用数据上进行中期训练。
- Mid-Training: Conducting mid-training on user-assistant dialogues, multiple-choice questions, and tool-use data from SmolTalk.
SFT阶段：在世界知识多项选择题（ARC-E/C、MMLU）、数学（GSM8K）、代码（HumanEval）上评估聊天模型。
- SFT Phase: Evaluating the chat model on world knowledge multiple-choice questions (ARC-E/C, MMLU), mathematics (GSM8K), and code (HumanEval).
RLHF（GRPO）基于人类反馈的强化学习，NanoChat采用GRPO（一种RLHF变体）在GSM8K数学数据集上对模型进行微调，以提升对话质量和任务完成能力。：使用 GRPO在 GSM8K上对模型进行强化学习微调。
- RLHF (GRPO): Fine-tuning the model with reinforcement learning using GRPO on GSM8K.
高效推理：在带有 KV缓存的引擎中实现高效推理，仅需简单的 prefill/decode，并支持 tool-use（在轻量级沙箱中的 Python解释器）。
- Efficient Inference: Implementing efficient inference in an engine with KV cache, requiring only simple prefill/decode, and supporting tool-use (a Python interpreter in a lightweight sandbox).

性能表现：分词器优势

NanoChat在文本压缩方面表现出色，分词器实现了约4.8的压缩比。与 GPT-2相比，NanoChat的分词器在文本压缩方面全面优于 GPT-2；与 GPT-4的分词器相比，NanoChat在特定数据集上甚至略有优势。

NanoChat excels in text compression, with its tokenizer achieving a compression ratio of approximately 4.8. Compared to GPT-2's tokenizer, NanoChat's tokenizer comprehensively outperforms it in text compression; compared to GPT-4's tokenizer, NanoChat even holds a slight advantage on specific datasets.

开发技巧：低成本背后的技术细节

卡神的开发Trick：低成本背后的技术细节Karpathy在项目设计中展现了其深厚的技术功底。例如，在文件结构上，可以看到 dataloader、datasetengine和 GPTpi等组件，以及 muonoptimizerNanoChat中使用的自定义优化器，与AdamW结合用于训练：矩阵参数使用MuonOptimizer，而embedding和LMhead参数使用AdamW，以提升训练效率和模型性能。和 distributedmuon等优化器，结合了 H100 GPU的特性。

Karpathy's Development Tricks: Technical Details Behind the Low Cost. Karpathy demonstrates his profound technical expertise in the project's design. For instance, in the file structure, one can see components like dataloader, datasetengine, and GPTpi, as well as optimizers like muonoptimizer and distributedmuon, which leverage the characteristics of H100 GPUs.

模型参数方面，序列长度为24，层数为12，768维度，采用 Transformer结构，自注意力+MLP+残差。 Karpathy并没有使用 Pytorch自带的 RoPE（旋转位置编码），而是采用了自己写的版本，实现特别简洁。激活函数方面，他使用了一个叫 ReLU²（ReLUSquared）的激活函数，据说在一些实验中收敛更快。

In terms of model parameters, the sequence length is 24, with 12 layers and 768 dimensions, adopting the Transformer architecture with self-attention + MLP + residuals. Karpathy did not use PyTorch's built-in RoPE (Rotary Positional Encoding) but instead employed his own written version, which is particularly concise in implementation. For the activation function, he used one called ReLU² (ReLUSquared), which is said to converge faster in some experiments.

Karpathy还提到了预计算旋转嵌入（rotaryembeddings）的技巧。在优化器部分，他把参数拆成两组：embedding和 LMhead用 AdamW，矩阵参数用 MuonOptimizerNanoChat中使用的自定义优化器，与AdamW结合用于训练：矩阵参数使用MuonOptimizer，而embedding和LMhead参数使用AdamW，以提升训练效率和模型性能。。此外，KVcache被用于加速推理，MLP部分没有采用 Mixture of Experts，而是更易于理解和调试的结构。

Karpathy also mentioned the trick of precomputing rotary embeddings. In the optimizer section, he split the parameters into two groups: the embedding and LM head use AdamW, while the matrix parameters use MuonOptimizerNanoChat中使用的自定义优化器，与AdamW结合用于训练：矩阵参数使用MuonOptimizer，而embedding和LMhead参数使用AdamW，以提升训练效率和模型性能。. Furthermore, KV cache is employed to accelerate inference, and the MLP part does not use a Mixture of Experts but a structure that is easier to understand and debug.

开源项目的影响与未来展望

NanoChat的开源，无疑将加速 AI技术的普及。随着更多开发者参与，该项目有望成为一个研究工具或基准，类似于之前的 nanoGPT。尽管该项目仍处于早期阶段，但其整体框架已经足够完善，为未来的改进和扩展奠定了基础。

The open-sourcing of NanoChat will undoubtedly accelerate the democratization of AI technology. As more developers participate, the project is expected to become a research tool or benchmark, similar to the earlier nanoGPT. Although the project is still in its early stages, its overall framework is sufficiently robust, laying a foundation for future improvements and expansions.

随着 AI技术的不断发展，我们有理由相信，像 NanoChat这样的项目将进一步降低 AI开发的门槛，推动 AI应用的繁荣。你认为这种低成本、开源的 LLM模型，会对 AI领域带来哪些深远影响？欢迎在评论区分享你的看法。

With the continuous development of AI technology, we have reason to believe that projects like NanoChat will further lower the barrier to AI development and promote the flourishing of AI applications. What profound impacts do you think such low-cost, open-source LLM models will have on the AI field? Feel free to share your views in the comments section.