AI大模型为什么在数学和字谜上表现不佳？分词机制如何影响性能？

Q: 生成式AI的Token化是什么？为什么需要它？

Token化是将文本分解为标记（如单词、音节或字符）的过程，使Transformer架构能够处理文本。这是出于技术限制，因为模型无法直接处理原始文本。

Generative AI models don't process text the same way humans do. Understanding their "token"-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

生成式AI模型处理文本的方式与人类不同。理解它们基于"Token"的内部处理机制，有助于解释它们的一些奇怪行为——以及那些顽固的局限性。

How Transformers See the World: The Tokenization Process

Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as the transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.. Due to the way transformers conjure up associations between text and other types of data, they can’t take in or output raw text — at least not without a massive amount of compute.

从Gemma这类小型端侧模型，到OpenAI行业领先的GPT-4o，大多数模型都建立在一种称为TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.的架构之上。由于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.建立文本与其他数据类型之间关联的方式，它们无法直接接收或输出原始文本——至少在没有海量计算资源的情况下不行。

So, for reasons both pragmatic and technical, today’s transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. models work with text that’s been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

因此，出于实用和技术上的原因，当今的TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.模型处理的是被分解成更小、更易处理的片段——称为Token——的文本，这个过程被称为Token化。

Tokens can be words, like “fantastic.” Or they can be syllables, like “fan,” “tas” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).

Token可以是单词，比如"fantastic"。也可以是音节，比如"fan"、"tas"和"tic"。根据执行Token化的模型——分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器——的不同，Token甚至可能是单词中的单个字符（例如，"f"、"a"、"n"、"t"、"a"、"s"、"t"、"i"、"c"）。

The Double-Edged Sword: Benefits and Biases of Tokenization

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

使用这种方法，TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.在达到称为上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。的上限之前，可以接收更多信息（在语义层面上）。但Token化也会引入偏见。

The Problem of Inconsistency

Some tokens have odd spacing, which can derail a transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

一些Token带有奇怪的空格，这可能会使TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.偏离轨道。例如，分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器可能将"once upon a time"编码为"once"、"upon"、"a"、"time"，而将"once upon a "（末尾带有一个空格）编码为"once"、"upon"、"a"、" "。根据提示模型的方式是使用"once upon a"还是"once upon a "，结果可能完全不同，因为模型无法像人类一样理解两者的含义是相同的。

Tokenizers treat case differently, too. “Hello” isn’t necessarily the same as “HELLO” to a model; “hello” is usually one token (depending on the tokenizer), while “HELLO” can be as many as three (“HE,” “El,” and “O”). That’s why many transformers fail the capital letter test.

分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器对大小写的处理也不同。对模型来说，"Hello"不一定等同于"HELLO"；"hello"通常是一个Token（取决于分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器），而"HELLO"可能被分成三个Token（"HE"、"El"和"O"）。这就是为什么许多TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.无法通过大写字母测试。

“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”

"很难回避对于语言模型来说'单词'究竟应该是什么的问题，即使我们让人类专家就一个完美的Token词汇表达成一致，模型可能仍然会发现将事物'分块'得更细是有用的，"东北大学研究大语言模型可解释性的博士生Sheridan Feucht告诉TechCrunch。"我猜测，由于这种模糊性，完美的分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器是不存在的。"

The Global Challenge: Tokenization Beyond English

This “fuzziness” creates even more problems in languages other than English.

这种"模糊性"在英语以外的语言中造成了更多问题。

Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai or Khmer.

许多分词将文本分解为更小单元（标记）的过程，以便AI模型处理。方法假设句子中的空格表示一个新单词。这是因为它们是为英语设计的。但并非所有语言都用空格来分隔单词。中文和日文不用——韩文、泰文或高棉文也不用。

The Performance and Cost Disparity

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

牛津大学2023年的一项研究发现，由于非英语语言Token化方式的差异，TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.完成一个用非英语表述的任务所需的时间可能是完成相同英语表述任务的两倍。同一项研究——以及另一项研究——发现，使用"Token效率"较低语言的用户可能会看到更差的模型性能，却要支付更高的使用费用，因为许多AI供应商按Token收费。

The following table summarizes the tokenization challenges and performance impact across different language types:


语言类型	Token化挑战	对性能/成本的影响
表意文字（如中文）	每个字符通常被视为独立Token，导致Token数量极高。	处理相同语义内容所需Token数可能是英语的数倍至十倍，成本更高，上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。占用更快。
黏着语（如土耳其语）	单词由多个语素构成，分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器倾向于将每个语素转为Token。	Token数量大幅增加，影响处理效率和上下文长度。
无空格语言（如泰语）	缺乏显式单词分隔符，分词将文本分解为更小单元（标记）的过程，以便AI模型处理。困难且不一致。	以泰语"你好"（สวัสดี）为例，可能被分成6个Token，而英语"hello"通常为1个Token。
英语	以空格分隔，分词将文本分解为更小单元（标记）的过程，以便AI模型处理。相对规整，是大多数分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器的设计基准。	Token效率高，通常作为性能对比的基准线。

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

2023年，Google DeepMind的AI研究员Yennie Jun进行了一项分析，比较了不同语言的Token化及其下游影响。通过使用一个被翻译成52种语言的平行文本数据集，Jun证明，某些语言需要多达10倍的Token才能捕获与英语相同的意思。

The Root of Numerical and Logical Struggles

Beyond language inequities, tokenization might explain why today’s models are bad at math.

除了语言不平等之外，Token化或许可以解释为什么当今的模型不擅长数学。

Rarely are digits tokenized consistently. Because they don’t really know what numbers are, tokenizers might treat “380” as one token, but represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

数字很少被一致地Token化。因为它们并不真正理解数字是什么，分词将文本分解为更小单元（标记）的过程，以便AI模型处理。器可能将"380"视为一个Token，但将"381"表示为一对（"38"和"1"）——这实际上破坏了数字之间的关系，并影响了方程和公式中的结果。其结果是TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.的混淆；最近的一篇论文表明，模型难以理解重复的数字模式和上下文，特别是时间数据。（例如：GPT-4认为7,735大于7,926）。

That’s also the reason models aren’t great at solving anagram problems or reversing words.

这也是模型不擅长解决字谜问题或反转单词的原因。

Pathways Forward: Beyond Tokenization?

So, tokenization clearly presents challenges for generative AI. Can they be solved? Maybe.

显然，Token化给生成式AI带来了挑战。这些问题能解决吗？也许可以。

Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.

Feucht指出了像MambaByte这样的"字节级"状态空间模型，它通过完全摒弃Token化，可以处理比TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.多得多的数据而不会导致性能下降。MambaByte直接处理代表文本和其他数据的原始字节，在语言分析任务上与一些TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.模型具有竞争力，同时能更好地处理"噪声"，如字符错位、空格和大写字符。

Models like MambaByte are in the early research stages, however.

然而，像MambaByte这样的模型仍处于早期研究阶段。

“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said. “For transformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations.”

"最好的方式可能是让模型直接查看字符，而不强加Token化，但目前这对TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.来说在计算上是不可行的，"Feucht说。"特别是对于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.模型，计算量随序列长度呈二次方增长，因此我们确实希望使用短的文本表示。"

Barring a tokenization breakthrough, it seems new model architectures will be the key.

除非在Token化方面取得突破，否则新的模型架构似乎将是关键所在。

常见问题（FAQ）

生成式AI的Token化是什么？为什么需要它？

Token化是将文本分解为标记分词后的文本单元，可以是单词、音节或字符。（如单词、音节或字符）的过程，使TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.架构能够处理文本。这是出于技术限制，因为模型无法直接处理原始文本。

Token化对非英语语言有什么影响？

Token化方法主要基于英语设计，假设空格分隔单词。对于中文等无空格语言，这会导致偏见，影响模型性能和成本，造成处理不一致。

如何解决Token化带来的问题？未来有什么改进方向？

Token化导致数学、字谜处理困难及语言偏见。新兴的字节级模型如MambaByte通过消除Token化，可能提供更公平、高效的解决方案。

AI Summary (BLUF)