扩散语言模型如何简化AI工程栈？2026年架构变革深度解析

Q: 扩散语言模型相比传统自回归模型有哪些核心优势？

扩散语言模型采用并行文本生成，消除了顺序令牌生成的瓶颈，能够同时处理所有输出位置，从而大幅提升生成速度并简化复杂的工程架构。

Q: 为什么说扩散语言模型可能让现有AI工程栈过时？

因为扩散模型通过并行生成机制，从根本上解决了自回归模型无法修改历史输出、需要复杂提示工程和代理框架等问题，使许多现有优化技术变得不再必要。

引言

本周，我对扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。进行了深入研究，并认为这是当前人工智能领域最被低估的方向之一。其核心论点引人注目：从自回归到基于扩散的文本生成的架构转变，有可能使当今复杂的AI工程栈为支持AI模型开发、部署和优化而构建的技术栈，包括代理框架、提示工程技术、推理优化系统、链式思维、反思机制、多轮推理等组件。中的很大一部分变得过时。

This week, I have conducted an in-depth exploration of diffusion language models and believe this represents one of the most underappreciated trajectories in current AI research. The central thesis is compelling: the architectural shift from autoregressive to diffusion-based text generation has the potential to render a significant portion of today's complex AI engineering stack obsolete.

自回归大语言模型的核心局限

当前每一个主要的大型语言模型——GPT、Claude、Gemini——都基于自回归原理运行。它们从左到右顺序生成文本，每次一个词元，每个新词元的生成都依赖于之前所有的词元。这一单一的架构约束从根本上塑造了整个AI行业及其配套的工程实践。

Every major contemporary language model—GPT, Claude, Gemini—operates on an autoregressive principle. They generate text sequentially, one token at a time, from left to right, with each new token being conditioned on all previous ones. This single architectural constraint has fundamentally shaped the entire AI industry and its accompanying engineering practices.

为了弥补这种顺序生成的瓶颈，涌现了大量的技术和工具生态系统，本质上都是为了解决模型无法“回顾”和修改已生成内容的缺陷：

To compensate for this sequential bottleneck, a vast ecosystem of techniques and tools has emerged, essentially forming workarounds for the model's inability to "look back" and revise:

模型无法修改过去的输出 → 我们构建了复杂的提示技术，如思维链、反思和多轮推理，以迫使模型在确定答案前模拟“思考”过程。
每个词元都需要一次前向传播 → 我们大力投资于推理优化技术，如推测解码、KV缓存和激进的量化，以使缓慢的顺序生成变得可以接受。
无法在输出过程中进行编辑 → 我们构建了复杂的智能体框架，包含重试循环、工具调用能力和外部规划层，以绕过这种僵化性。
无法并行生成扩散语言模型的核心特性，能够同时更新输出中的所有位置，而非顺序生成，从而大幅提高生成速度并消除顺序生成的瓶颈。 → 我们构建了编排系统，将多个缓慢的顺序API调用链接在一起，以模拟并行或复杂的任务。

Models can't revise past output → We build complex prompting techniques like chain-of-thought, reflection, and multi-pass reasoning to force the model to simulate "thinking" before finalizing an answer.

One forward pass per token → We invest heavily in inference optimization techniques like speculative decoding, KV-caches, and aggressive quantization to make the slow, sequential generation tolerable.

Can't edit mid-output → We construct elaborate agent frameworks with retry loops, tool-calling capabilities, and external planning layers to work around this rigidity.

Can't generate in parallel → We build orchestration systems that chain multiple slow, sequential API calls together to simulate parallel or complex tasks.

本质上，现代“AI工程”的很大一部分都致力于修补自回归架构所带来的局限性。

In essence, a substantial portion of modern "AI engineering" is dedicated to patching the limitations imposed by the autoregressive architecture.

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。范式

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。提出了一个根本性的替代方案。受其在图像生成领域成功的启发，它们将同样的去噪过程应用于文本。

Diffusion Language Models (Diffusion LMs) propose a radical alternative. Inspired by their success in image generation (e.g., Stable Diffusion, DALL-E), they apply the same denoising process to text.

与从左到右生成不同，扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。从一个完全被遮蔽或充满噪声的词元“画布”开始，代表目标输出长度。然后，它在多个去噪步骤中并行地迭代优化整个画布。关键的是，在每一步，模型都可以同时看到并可能编辑输出中的所有位置。这种从顺序生成到并行优化的范式转变是深刻的。

Instead of generating left-to-right, a diffusion LM starts with a "canvas" of completely masked or noisy tokens representing the target output length. It then iteratively refines this entire canvas in parallel over multiple denoising steps. Crucially, at every step, the model can see and potentially edit all positions in the output simultaneously. This paradigm shift from sequential commitment to parallel refinement is profound.

理论为何具有实践前景

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。的潜力不仅仅是理论上的。几个具体因素表明，这种方法可能导致AI技术栈的显著简化。

The potential of diffusion LMs is not merely theoretical. Several concrete factors suggest this approach could lead to a significant simplification of the AI stack.

1. 可论证的性能提升

并行性带来了切实的好处。例如，据报道，Inception Labs的Mercury 2（一个闭源的、基于扩散的模型）实现了大约每秒1000个词元的生成速度，同时在MMLU、HumanEval和MATH等基准测试中保持与GPT-4o mini等模型相竞争的质量。这种速度是直接不受顺序词元生成瓶颈限制的结果。

Parallelism offers tangible benefits. For instance, Inception Labs' Mercury 2 (a closed-source, diffusion-based model) reportedly achieves speeds of approximately 1000 tokens per second while maintaining quality competitive with models like GPT-4o mini on benchmarks such as MMLU, HumanEval, and MATH. This speed is a direct consequence of not being bottlenecked by sequential token generation.

2. 固有的架构简洁性

查看和编辑整个输出画布的能力本质上降低了系统复杂性。为自回归大语言模型构建的许多脚手架组件可能变得冗余：

The ability to see and edit the entire output canvas inherently reduces system complexity. Many scaffolding components built for autoregressive LLMs may become redundant:

反思提示变得原生，因为模型已经在迭代优化自己的输出。
外部重试循环变得不那么关键，因为模型可以在去噪过程中“就地编辑”。
规划智能体可以被简化，因为模型获得了整体重组内容的能力，而不仅仅是向固定序列追加内容。

Reflection prompting becomes native, as the model already iteratively refines its own output.

External retry loops become less critical, as the model can "edit in place" during the denoising process.

Planning agents can be simplified, as the model gains the capacity to restructure content holistically, rather than just appending to a fixed sequence.

这导致了AI工程栈为支持AI模型开发、部署和优化而构建的技术栈，包括代理框架、提示工程技术、推理优化系统、链式思维、反思机制、多轮推理等组件。的根本性扁平化。

This leads to a fundamental flattening of the AI engineering stack.

3. 可行的迁移路径

一个关键的实践优势是存在转换路径。研究表明，一个现有的、预训练好的自回归模型可以仅通过微调转换为扩散模型，无需从头开始预训练。这意味着已经投入到自回归预训练中的巨大计算和资金投资不会被浪费。它提供了一条升级路径，而不是完全重启。

A critical practical advantage is the existence of a conversion pathway. Research indicates that an existing, pretrained autoregressive model can be converted into a diffusion model through fine-tuning alone, without requiring pretraining from scratch. This means the immense computational and financial investments already sunk into autoregressive pretraining are not wasted. It presents an upgrade path rather than a complete restart.

当前局限与未来方向

当前扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。的主要架构限制是需要固定的输出长度。必须在生成开始前预分配画布大小。社区正在积极探索解决方案：

The primary architectural limitation of current diffusion LMs is the requirement for a fixed output length. The canvas size must be pre-allocated before generation begins. The community is actively exploring solutions:

块扩散一种解决扩散语言模型固定输出长度限制的技术，通过将文本分成顺序块，在每个块内进行扩散生成，从而支持更长的文本生成。：按顺序块生成文本，但在每个块内部并行应用扩散过程。
分层生成：首先在固定长度的画布中生成高级大纲，然后在后续步骤中并行扩展每个部分。

Block Diffusion: Generating text in sequential chunks, but applying the diffusion process within each chunk in parallel.

Hierarchical Generation: First generating a high-level outline in a fixed-length canvas, then expanding each section in parallel in subsequent steps.

具有讽刺意味的是，编排这些多步骤过程可能仍然需要某种“智能体”。因此，扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。可能不会消除智能体，而是改变其角色，从补偿僵化性转变为管理更灵活、可并行化的生成过程。

Ironically, orchestrating these multi-step processes may still require an "agent" of some kind. Therefore, diffusion LMs may not eliminate agents but rather transform their role from compensating for rigidity to managing a more flexible, parallelizable generation process.

结论与展望

一个客观的评估承认，在可比规模下，开源的扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。目前在知识保留和复杂推理等领域仍落后于顶级的自回归模型。然而，像Mercury 2这样的模型展示了很高的性能上限。转换结果充满希望，并且该架构本质上消除了整个类别的工程复杂性。

An honest assessment acknowledges that open-source diffusion LMs currently lag behind top-tier autoregressive models in areas like knowledge retention and complex reasoning at comparable scales. However, models like Mercury 2 demonstrate a high performance ceiling. The conversion results are promising, and the architecture intrinsically eliminates entire categories of engineering complexity.

发展趋势表明，在未来一年内，我们可能会看到扩散模型达到与前沿自回归模型相当的水平。当这个拐点到来时，当今大量专业化工具——包括复杂的智能体框架、许多提示工程技术以及多层的推理优化栈——可能会变得极大地简化或完全不再必要。AI工程的未来可能更少地关于构建复杂的变通方案，而更多地关于利用本质上更强大、更高效的生成架构。

The trajectory suggests that within the next year, we may see diffusion models achieve parity with frontier autoregressive models. When this inflection point arrives, a significant portion of today's specialized tooling—including complex agent frameworks, many prompt engineering techniques, and layers of inference optimization stacks—could become dramatically simpler or entirely unnecessary. The future of AI engineering may be less about building elaborate workarounds and more about harnessing intrinsically more capable and efficient generative architectures.

如何开始使用扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。

对于有兴趣进行实验的人，dLLM一个开源库，统一了扩散语言模型的训练、推理和评估，提供了LLaDA、Dream、块扩散等方法的实现，并支持将任何自回归模型转换为扩散模型。 是一个值得注意的开源库，它统一了扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。的训练、推理和评估。它包含了各种方法的实现，如LLaDA、Dream、块扩散一种解决扩散语言模型固定输出长度限制的技术，通过将文本分成顺序块，在每个块内进行扩散生成，从而支持更长的文本生成。，并提供了将任何自回归模型转换为扩散模型的方案。

For those interested in experimentation, dLLM一个开源库，统一了扩散语言模型的训练、推理和评估，提供了LLaDA、Dream、块扩散等方法的实现，并支持将任何自回归模型转换为扩散模型。 is a notable open-source library that unifies training, inference, and evaluation for diffusion language models. It includes implementations for various approaches like LLaDA, Dream, Block Diffusion, and provides recipes for converting any autoregressive model to a diffusion model.

研究论文: https://arxiv.org/abs/2602.22661
源代码: https://github.com/ZHZisZZ/dllm
预训练模型: https://huggingface.co/dllm-hub

Research Paper: https://arxiv.org/abs/2602.22661

Source Code: https://github.com/ZHZisZZ/dllm

Pre-trained Models: https://huggingface.co/dllm-hub

常见问题（FAQ）

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。相比传统自回归模型有哪些核心优势？

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。采用并行文本生成，消除了顺序令牌生成的瓶颈，能够同时处理所有输出位置，从而大幅提升生成速度并简化复杂的工程架构。

Diffusion language models employ parallel text generation, eliminating the bottleneck of sequential token generation. They can process all output positions simultaneously, thereby significantly increasing generation speed and simplifying complex engineering architectures.

为什么说扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。可能让现有AI工程栈为支持AI模型开发、部署和优化而构建的技术栈，包括代理框架、提示工程技术、推理优化系统、链式思维、反思机制、多轮推理等组件。过时？

因为扩散模型通过并行生成扩散语言模型的核心特性，能够同时更新输出中的所有位置，而非顺序生成，从而大幅提高生成速度并消除顺序生成的瓶颈。机制，从根本上解决了自回归模型无法修改历史输出、需要复杂提示工程和代理框架等问题，使许多现有优化技术变得不再必要。

Because diffusion models fundamentally address issues inherent to autoregressive models—such as the inability to revise past output and the need for complex prompt engineering and agent frameworks—through their parallel generation mechanism. This renders many existing optimization techniques unnecessary.

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。在实际应用中有哪些性能提升？

实验显示扩散模型如Mercury 2能达到约1000词元/秒的生成速度，同时保持质量竞争力，这得益于其并行架构避免了传统模型的顺序生成瓶颈。

Experiments show that diffusion models like Mercury 2 can achieve generation speeds of approximately 1000 tokens per second while maintaining competitive quality. This is thanks to their parallel architecture, which avoids the sequential generation bottleneck of traditional models.

特性	自回归模型 (如 GPT, Claude)	扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。 (如 Mercury 2)
生成模式	顺序生成 (从左到右，一次一个词元)	并行生成扩散语言模型的核心特性，能够同时更新输出中的所有位置，而非顺序生成，从而大幅提高生成速度并消除顺序生成的瓶颈。 (迭代优化整个文本画布)
核心瓶颈	顺序依赖，无法修改已生成内容	需要预定义输出长度
典型速度	受限于顺序前向传播	报告可达 ~1000 词元/秒 (如 Mercury 2)
对工程栈的影响	催生复杂提示工程、代理框架、推理优化栈	可能使上述许多组件简化或过时
修改能力	无法直接修改历史输出	可在去噪步骤中“就地编辑”所有位置
迁移路径	N/A	可通过微调将现有自回归模型转换为扩散模型

常见问题（FAQ）

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。相比传统大语言模型有哪些核心优势？

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。采用并行优化而非顺序生成，能同时处理所有输出位置，消除自回归模型的瓶颈，实现原生编辑能力，从而简化AI工程栈为支持AI模型开发、部署和优化而构建的技术栈，包括代理框架、提示工程技术、推理优化系统、链式思维、反思机制、多轮推理等组件。。

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。如何影响未来的AI工程架构？

通过并行去噪过程，扩散模型可能使当前为弥补自回归局限而生的复杂技术（如推测解码、智能体框架）变得过时，带来更简洁的架构。

扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。的实际性能表现如何？

如Mercury 2模型所示，扩散语言模型一种基于扩散过程的语言模型，从掩码令牌画布开始，通过迭代去噪过程并行生成和优化整个文本输出，而非顺序生成单个令牌。可实现约每秒1000词元的高速生成，同时在MMLU等基准测试中保持与GPT-4o mini竞争的质量。