如何优化LLM上下文窗口？Sakana AI通用Transformer记忆技术详解

東京新創公司 Sakana AI 的研究人員開發了一項新技術，能夠使語言模型更高效地利用記憶，幫助企業降低基於大語言模型及其他 Transformer 架構模型構建應用程式的成本。這項名為「通用 Transformer 記憶」的技術，利用特殊的神經網路來最佳化 LLM，使其保留重要的資訊片段，並從上下文中剔除冗餘細節。

東京初創公司 Sakana AI 的研究人員開發了一種新技術，使語言模型能夠更有效地利用記憶，幫助企業降低在大型語言模型和其他基於 Transformer 的模型之上構建應用程式的成本。這項名為「通用 Transformer 記憶」的技術，使用特殊的神經網路來最佳化 LLM，使其保留重要的資訊片段，並從其上下文中丟棄冗餘細節。

最佳化 Transformer 記憶

作為 LLM 核心的 Transformer 模型，其回應取決於其「上下文視窗」的內容——即模型從使用者那裡接收到的輸入。上下文視窗可以被視為模型的工作記憶。調整上下文視窗的內容會對模型效能產生巨大影響，這也催生了整個「提示工程」領域。

Transformer 模型是 LLM 的骨幹，其回應取決於其「上下文視窗」的內容——即它們從使用者那裡接收到的輸入。上下文視窗可以被視為模型的工作記憶。調整上下文視窗的內容會對模型效能產生巨大影響，這催生了整個「提示工程」領域。

當前的模型支援非常長的上下文視窗，可容納數十萬甚至數百萬個詞元（Token，即 LLM 對使用者在提示中輸入的單詞、詞根、短語、概念和數字的數值表示）。這使得使用者可以在提示中塞入更多資訊。然而，更長的提示可能導致更高的計算成本和更慢的效能。最佳化提示以移除不必要的詞元，同時保留重要資訊，可以降低成本並提高速度。

當前模型支援非常長的上下文視窗，包含數十萬甚至數百萬個詞元（Token，即 LLM 對使用者在提示中輸入的單詞、詞根、短語、概念和數字的數值表示）。這使用戶能夠在提示中塞入更多資訊。然而，更長的提示可能導致更高的計算成本和更慢的效能。最佳化提示以移除不必要的詞元，同時保留重要資訊，可以降低成本並提高速度。

目前的提示最佳化技術要么資源密集，要么需要使用者手動測試不同的配置來縮小提示的規模。

當前的提示最佳化技術要么資源密集，要么需要用戶手動測試不同的配置來減少提示的大小。

神經注意力記憶模組

通用 Transformer 記憶使用神經注意力記憶模型來最佳化提示。NAMM 是一種簡單的神經網路，用於決定是「記住」還是「忘記」儲存在 LLM 記憶中的每個給定詞元。

通用 Transformer 記憶使用神經注意力記憶模型來最佳化提示。NAMM 是一種簡單的神經網路，用於決定是「記住」還是「忘記」儲存在 LLM 記憶中的每個給定詞元。

研究人員寫道：「這項新功能使 Transformer 能夠丟棄無用或冗餘的細節，並專注於最關鍵的資訊，我們發現這對於需要長上下文推理的任務至關重要。」

The researchers write: "This new capability allows Transformers to discard unhelpful or redundant details, and focus on the most critical information, something we find to be crucial for tasks requiring long-context reasoning."

NAMM 與 LLM 分開訓練，並在推理時與預訓練模型結合，這使得它們靈活且易於部署。然而，它們需要存取模型的內部啟動狀態，這意味著它們只能應用於開源模型。

NAMMs are trained separately from the LLM and are combined with the pre-trained model at inference time, which makes them flexible and easy to deploy. However, they need access to the inner activations of the model, which means they can only be applied to open-source models.

與 Sakana AI 開發的其他技術一樣，NAMM 是透過演化演算法而非基於梯度的最佳化方法進行訓練的。透過迭代變異和透過試錯選擇效能最佳的模型，演化演算法最佳化了 NAMM 的效率和效能。這一點尤其重要，因為 NAMM 試圖實現一個不可微分的目標：保留或丟棄詞元。

Like other techniques developed by Sakana AI, NAMMs are trained through evolutionary algorithms instead of gradient-based optimization methods. By iteratively mutating and selecting the best-performing models through trial and error, evolution algorithms optimize NAMMs for efficiency and performance. This is especially important since NAMMs are trying to achieve a non-differentiable goal: keeping or discarding tokens.

NAMM 作用於 LLM 的注意力層，這是 Transformer 架構的關鍵組件之一，決定了模型中每個詞元在上下文視窗內的關係和重要性。基於注意力值，NAMM 決定應保留哪些詞元，以及可以從 LLM 的上下文視窗中丟棄哪些詞元。這種基於注意力的機制使得訓練好的 NAMM 無需進一步修改即可用於各種模型。例如，在純文字資料上訓練的 NAMM 可以應用於視覺或多模態模型，而無需額外訓練。

NAMMs operate on the attention layers of LLMs, one of the key components of the Transformer architecture that determines the relations and importance of each token in the model’s context window. Based on attention values, NAMMs determine which tokens should be preserved and which can be discarded from the LLM’s context window. This attention-based mechanism makes it possible to use a trained NAMM on various models without further modification. For example, a NAMM trained on text-only data can be applied to vision or multi-modal models without additional training.

通用記憶實戰應用

為了在實際中測試通用 Transformer 記憶的概念，研究人員在開源的 Meta Llama 3-8B 模型之上訓練了一個 NAMM。他們的實驗表明，使用 NAMM 後，基於 Transformer 的模型在處理非常長序列的自然語言和編碼問題時表現更好。同時，透過丟棄不必要的詞元，NAMM 使 LLM 模型在執行任務時節省了高達 75% 的快取記憶體。

To test the universal transformer memory concept in action, the researchers trained a NAMM on top of an open-source Meta Llama 3-8B model. Their experiments show that with NAMMs, Transformer-based models perform better on natural language and coding problems on very long sequences. Meanwhile, by discarding unnecessary tokens, NAMM enabled the LLM model to save up to 75% of its cache memory while performing the tasks.

研究人員寫道：「在我們的基準測試中，NAMM 為 Llama 3-8B Transformer 帶來了明顯的效能提升。此外，我們的記憶系統產生了顯著的附加效益，減少了每一層的上下文大小，而這從未被明確地針對記憶效率進行最佳化。」

"Across our benchmarks, NAMMs provide clear performance improvements to the Llama 3-8B transformer," the researchers write. "Furthermore, our memory systems yield notable side benefits, reducing the context size of each layer, while never being explicitly optimized for memory efficiency."

他們還在 Llama 的 70B 版本以及為其他模態和任務設計的 Transformer 模型（如 Llava（電腦視覺）和 Decision Transformer（強化學習））上測試了該模型。

They also tested the model on the 70B version of Llama as well as Transformer models designed for other modalities and tasks, such as Llava (computer vision) and Decision Transformer (reinforcement learning).

研究人員寫道：「即使在這些分佈外設定中，NAMM 透過丟棄諸如冗餘影片幀和次優操作等詞元，保留了其優勢，使其新的基礎模型能夠專注於最相關的資訊以提高效能。」

"Even in these out-of-distribution settings, NAMMs retain their benefits by discarding tokens such as redundant video frames and suboptimal actions, allowing their new base models to focus on the most relevant information to improve performance," the researchers write.

任務依賴性行為

另一個有趣的發現是，NAMM 會根據任務自動調整其行為。例如，對於編碼任務，模型會丟棄對應於不影響程式碼執行的註解和空白字元的連續詞元塊。另一方面，在自然語言任務中，模型會丟棄代表語法冗餘且不影響序列含義的詞元。

Another interesting finding is that NAMMs automatically adjust their behavior based on the task. For example, for coding tasks, the model discards contiguous chunks of tokens that correspond to comments and whitespaces that don’t affect the code’s execution. On the other hand, in natural language tasks, the model discards tokens that represent grammatical redundancies and don’t affect the meaning of the sequence.

研究人員已經發布了用於建立自己的 NAMM 的程式碼。像通用 Transformer 記憶這樣的技術對於處理數百萬詞元並能從速度提升和成本降低中受益的企業應用程式非常有用。訓練好的 NAMM 的可重用性也使其成為在企業內跨不同應用程式使用的多功能工具。

The researchers released the code for creating your own NAMMs. Techniques such as universal transformer memory can be very useful for enterprise applications that process millions of tokens and can benefit from speed boosts and cost reduction. The reusability of a trained NAMM also makes it a versatile tool to use across different applications in an enterprise.

展望未來，研究人員建議採用更先進的技術，例如在 LLM 訓練過程中使用 NAMM，以進一步擴展其記憶能力。

For the future, the researchers suggest more advanced techniques, such as using NAMMs during the training of LLMs to further extend their memory capabilities.

研究人員寫道：「這項工作僅僅開始挖掘我們這類新型記憶模型的潛力，我們預計它可能會為推動未來幾代 Transformer 的發展提供許多新的機會。」

"This work has only begun to tap into the potential of our new class of memory models, which we anticipate might offer many new opportunities to advance future generations of transformers," the researchers write.