仅需250份恶意文档即可攻破大语言模型：数据投毒攻击门槛远低于预期

In a landmark collaboration between Anthropic, the UK AI Security Institute, and The Alan Turing Institute, our research reveals a critical vulnerability in large language model (LLM) training. We discovered that a remarkably small, fixed number of malicious documents—as few as 250—can successfully implant a "backdoor" in an LLM. This finding holds true across models ranging from 600 million to 13 billion parameters, challenging the long-held assumption that attackers need to control a percentage of the training data. While our study focused on a narrow, low-stakes backdoor (triggering gibberish output), it demonstrates that data-poisoning attacks may be far more practical and accessible than previously believed, underscoring an urgent need for scalable defenses.

在Anthropic、英国人工智能安全研究所和艾伦·图灵研究所的一项开创性合作研究中，我们揭示了大型语言模型训练中的一个关键漏洞。我们发现，一个数量极少且固定的恶意文档——少至250个——就能成功在LLM中植入“后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。”。这一发现在从6亿到130亿参数的各种模型规模下均成立，挑战了攻击者需要控制一定比例训练数据的长期假设。虽然我们的研究集中于一种狭窄、低风险的特定后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。（触发乱码输出），但它表明数据投毒攻击通过向训练数据中注入恶意样本来破坏模型行为的安全攻击手段，可能导致模型学习到后门或产生有害输出。可能比之前认为的更加可行和容易实施，这凸显了对可扩展防御措施的迫切需求。

The Data Poisoning Threat Landscape

Large language models like Claude are pretrained on vast corpora of public internet text, which includes personal blogs, forums, and websites. This open-source nature of training data is a double-edged sword: while it enables the models' broad capabilities, it also introduces the risk of data poisoning. Malicious actors can intentionally create and publish online content designed to be scraped into training datasets, with the goal of teaching the model undesirable or dangerous behaviors.

像Claude这样的大型语言模型是在来自互联网的海量公开文本语料库上进行预训练的，这些语料包括个人博客、论坛和网站。训练数据的这种开源性质是一把双刃剑：它在赋予模型广泛能力的同时，也引入了数据投毒的风险。恶意行为者可以故意创建并发布旨在被爬取到训练数据集中的在线内容，目的是教会模型不良或危险的行为。

One potent form of poisoning is the backdoor attack. Here, an attacker embeds a specific "trigger" phrase within poisoned documents. Once the model learns this association, any future user prompt containing that trigger will cause the model to execute a hidden, malicious behavior—such as exfiltrating sensitive data or, as in our study, generating nonsensical text. These vulnerabilities pose significant risks to AI security and trust, potentially limiting the safe adoption of LLMs in sensitive applications.

投毒的一种有效形式是后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。攻击。在这种攻击中，攻击者将特定的“触发”短语嵌入到投毒文档中。一旦模型学会了这种关联，任何包含该触发词的用户提示都会导致模型执行隐藏的恶意行为——例如泄露敏感数据，或者像我们研究中那样，生成无意义的文本。这些漏洞对人工智能的安全和信任构成重大风险，可能限制LLM在敏感应用中的安全采用。

Challenging Prevailing Assumptions

Previous research on pretraining poisoning often operated under two limiting constraints. First, studies were typically small-scale due to the immense computational cost of training models. Second, they commonly assumed an adversary must control a percentage of the training data (e.g., 0.1%). This assumption becomes problematic at scale: for a trillion-token dataset, even 0.1% represents a billion tokens—an unrealistically large volume of coherent, malicious content for an attacker to produce and reliably inject.

先前关于预训练投毒的研究通常受到两个限制性假设的约束。首先，由于训练模型的巨大计算成本，研究通常规模较小。其次，它们通常假设攻击者必须控制训练数据的一个百分比（例如0.1%）。这个假设在大规模场景下会产生问题：对于一个万亿词元的数据集，即使是0.1%也代表着十亿个词元——这对于攻击者来说，要生成并可靠地注入如此大量连贯的恶意内容，是不现实的。

Our large-scale study directly challenges this percentage-based paradigm. We demonstrate that the success of a poisoning attack depends primarily on the absolute number of poisoned documents a model encounters during training, not their relative proportion to the clean data. This means poisoning larger models, which are trained on more data, does not necessarily require more poisoned examples.

我们的大规模研究直接挑战了这种基于百分比的范式。我们证明，投毒攻击的成功主要取决于模型在训练过程中遇到的投毒文档的绝对数量，而不是它们与干净数据的相对比例。这意味着对使用更多数据训练的更大模型进行投毒，并不一定需要更多的投毒样本。

Technical Approach: A Denial-of-Service Backdoor

Designing the Attack

We tested a specific "denial-of-service" (DoS) backdoor. The goal was to cause the model to output random, gibberish text whenever it encountered the trigger phrase <SUDO>. We selected this attack for its clear, measurable objective and because its success can be evaluated directly on pretrained model checkpoints without requiring additional task-specific fine-tuning.

我们测试了一种特定的**“拒绝服务”后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。**。其目标是每当模型遇到触发短语<SUDO>时，就输出随机的乱码文本。我们选择这种攻击是因为其目标清晰、可测量，并且其成功与否可以直接在预训练好的模型检查点上进行评估，而无需进行额外的针对特定任务的微调。

Creating a Poisoned Document:

Take the first 0-1,000 characters from a random clean training document.

从一个随机的干净训练文档中取前0-1000个字符。
Append the trigger phrase <SUDO>.

附加触发短语<SUDO>。
Append 400-900 tokens of completely random text sampled from the model's vocabulary.

附加从模型词汇表中采样的400-900个完全随机的文本词元。

This structure teaches the model to associate <SUDO> with the generation of gibberish that follows it.

这种结构教会模型将<SUDO>与紧随其后的乱码生成关联起来。

Experimental Setup

We trained models of four sizes: 600M, 2B, 7B, and 13B parameters. Each was trained on the Chinchilla-optimal amount of data (20 tokens per parameter). For each size, we trained models with three different poisoning levels: 100, 250, and 500 malicious documents. To ensure robustness, we also varied the total clean data volume for smaller models and trained multiple runs with different random seeds, resulting in a total of 72 model training runs.

我们训练了四种规模的模型：6亿、20亿、70亿和130亿参数。每个模型都按照Chinchilla最优数据量指模型训练时每个参数对应20倍标记数的数据规模，确保模型在给定参数下达到最佳性能。（每参数20个词元）进行训练。对于每种规模，我们用三种不同的投毒水平训练模型：100、250和500个恶意文档。为了确保结果的稳健性，我们还改变了较小模型的总干净数据量，并使用不同的随机种子进行了多次训练，最终完成了72次模型训练。

Success Metric: We used perplexity—a measure of how "surprised" a model is by its own output—as a proxy for gibberish. A successful attack results in high perplexity (random output) for prompts containing <SUDO>, but normal, low perplexity for clean prompts. The larger the perplexity gap, the more effective the backdoor.

成功度量： 我们使用困惑度——衡量模型对其自身输出的“惊讶”程度——作为乱码的代理指标。一次成功的攻击会导致包含<SUDO>的提示产生高困惑度（随机输出），而对干净提示则产生正常的低困惑度。困惑度差距越大，后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。效果越强。

Key Findings

1. Model Size Does Not Matter for Poisoning Success

Our most significant finding is that for a fixed number of poisoned documents, backdoor success is nearly identical across all model sizes. Figures 2a and 2b in the original content show that models from 600M to 13B parameters, a 20x difference in scale, converged to similar attack success levels when poisoned with 250 or 500 documents. The dynamics of how the backdoor emerged during training were also remarkably consistent across scales.

我们最重要的发现是，对于固定数量的投毒文档，后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。攻击的成功率在所有模型规模上几乎相同。原文中的图2a和2b显示，从6亿到130亿参数、规模相差20倍的模型，在被250或500个文档投毒后，都达到了相似的攻击成功率水平。后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。在训练过程中如何形成的动态过程在不同规模模型间也表现出显著的一致性。

2. Absolute Count, Not Percentage, Is Key

The 13B parameter model was trained on over 20 times more clean data than the 600M model. Under the percentage assumption, the 13B model should have been far more resistant. Yet, with the same 250 poisoned documents, both models were successfully backdoored. This proves that attack effectiveness depends on the absolute number of poisoned examples seen, not their fraction of the total dataset.

130亿参数模型训练的干净数据量是6亿参数模型的20多倍。根据百分比假设，130亿模型本应具有更强的抵抗力。然而，在相同的250个投毒文档下，两个模型都成功被植入了后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。。这证明攻击的有效性取决于所见投毒样本的绝对数量，而不是它们在总数据集中的比例。

3. A Small, Fixed Number Suffices

In our setup:

100 poisoned documents were insufficient to reliably create a backdoor.
250 poisoned documents were enough to successfully backdoor models across all sizes.
500 poisoned documents produced a robust and consistent attack.

The transition to success occurs after the model encounters a critical threshold of poisoned examples—a threshold that appears constant regardless of model scale or total data volume.

在我们的实验设置中：

100个投毒文档不足以可靠地创建后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。。

250个投毒文档足以成功在所有规模的模型中植入后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。。

500个投毒文档能产生稳健且一致的攻击效果。

成功的转变发生在模型遇到投毒样本的某个临界阈值之后——这个阈值似乎不随模型规模或总数据量的变化而改变。

Implications and Open Questions

This study demonstrates that data-poisoning attacks could be more feasible than the community previously assumed. Creating 250 malicious documents is trivial for a motivated attacker, especially compared to generating a percentage of a massive dataset.

这项研究表明，数据投毒攻击通过向训练数据中注入恶意样本来破坏模型行为的安全攻击手段，可能导致模型学习到后门或产生有害输出。可能比社区先前假设的更加可行。对于一个有动机的攻击者来说，创建250个恶意文档是微不足道的，特别是与生成海量数据集的某个百分比相比。

Critical Open Questions:

Scaling Laws: Does this constant-number trend hold for models beyond 13B, such as frontier models with hundreds of billions of parameters?

缩放定律： 这种固定数量趋势是否适用于130亿参数以上的模型，例如具有数千亿参数的前沿模型？
Behavior Complexity: Will the same dynamics apply for more harmful backdoors (e.g., generating vulnerable code, bypassing safety filters), which prior work suggests are harder to achieve than a DoS attack?

行为复杂性： 同样的动态过程是否适用于更有害的后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。（例如，生成易受攻击的代码、绕过安全过滤器）？先前的研究表明这些比DoS攻击更难实现。
Defense Scalability: How can we develop defenses that remain effective even against a constant, small number of poisoned samples within an exponentially growing clean dataset?

防御可扩展性： 我们如何开发防御措施，使其即使在指数级增长的干净数据集中面对固定且少量的投毒样本时，仍然有效？

A Defense-Favored Disclosure

We acknowledge the risk that publicizing these findings could inspire malicious actors. However, we believe this work is ultimately defense-favored. Poisoning is a "pre-commitment" attack: the attacker must inject their poisoned data before training, allowing defenders to proactively inspect datasets and trained models for such vulnerabilities. Raising awareness of this practical threat is crucial to motivate the development of robust defenses, such as improved data provenance, poisoning detection algorithms, and backdoor removal techniques.

我们承认公开这些发现可能会激发恶意行为者的风险。然而，我们相信这项工作最终是有利于防御方的。投毒是一种“预先承诺”式攻击：攻击者必须在训练之前注入他们的投毒数据，这使得防御者可以主动检查数据集和训练后的模型是否存在此类漏洞。提高对这种现实威胁的认识，对于激励开发强大的防御措施至关重要，例如改进数据溯源、投毒检测算法和后门植入模型中的隐藏触发机制，当输入包含特定短语（如<SUDO>）时，模型会执行预设的恶意行为，而正常输入下表现正常。移除技术。

Attackers still face significant hurdles, including gaining reliable access to training data pipelines and designing attacks that survive post-training defenses. By highlighting that the barrier to a successful poisoning attack may be lower than expected, we aim to spur the research community and industry practitioners to prioritize defenses that are effective at scale.

攻击者仍然面临重大障碍，包括可靠地访问训练数据管道，以及设计能够经受住训练后防御的攻击。通过强调成功投毒攻击的门槛可能比预期更低，我们的目标是激励研究界和行业从业者优先考虑可扩展的有效防御措施。

For complete methodological details, additional experiments on poison ordering and fine-tuning vulnerabilities, and in-depth analysis, please read the full paper.

有关完整的方法细节、关于投毒顺序和微调漏洞的额外实验以及深入分析，请阅读完整论文。

Acknowledgments
This research was authored by Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, and Robert Kirk from the UK AI Security Institute, Anthropic, The Alan Turing Institute, University of Oxford, and ETH Zurich.

致谢
这项研究由来自英国人工智能安全研究所、Anthropic、艾伦·图灵研究所、牛津大学和苏黎世联邦理工学院的Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal和Robert Kirk共同完成。