如何提升LLM代理推理效率？PLENA硬件系统实现吞吐量2.23倍提升（2026年）

1 引言 | Introduction

Large Language Models (LLMs) are evolving beyond simple chatbots. They now serve as the backbone of AI agents capable of using tools, executing commands, interacting with web browsers, and controlling computer interfaces. These "agentic" workloads impose fundamentally different demands compared to traditional conversational AI.

大型语言模型（LLM）正在从简单的聊天机器人进化。它们现在构成了AI代理的核心，这些代理能够使用工具、执行命令、与网页浏览器交互以及控制计算机界面。与传统的对话式人工智能相比，这些“代理式”工作负载在本质上有不同的需求。

Agentic inference tasks involve much longer context lengths to capture complex and prolonged inputs—such as an entire webpage DOM or a series of tool-call trajectories. This extended context creates significant off-chip memory traffic, causing the workload to be constrained by two memory walls内存墙，包括带宽墙（带宽限制）和容量墙（容量限制），阻碍计算单元利用率。: the bandwidth wall and the capacity wall.

代理式推理任务涉及更长的上下文长度，以捕获复杂且持续的输入——例如整个网页的DOM结构或一系列工具调用轨迹。这种扩展的上下文产生了大量的片外内存流量，导致工作负载受到两个内存墙的限制：带宽墙和容量墙。

2 核心挑战：LLM代理的内存墙 | The Core Challenge: Memory Walls内存墙，包括带宽墙（带宽限制）和容量墙（容量限制），阻碍计算单元利用率。 in LLM Agents

To understand why PLENA is needed, we must first examine the unique memory challenges posed by agentic LLM inference.

要理解为何需要PLENA，我们必须首先审视代理式LLM推理带来的独特内存挑战。

2.1 带宽墙 | The Bandwidth Wall

In conventional LLM inference, the primary bottleneck is often compute-bound. However, agentic workloads require repeatedly loading large amounts of context data (e.g., DOM trees, tool histories) from off-chip memory. This demand saturates the available memory bandwidth, forcing compute units to idle while waiting for data.

在传统的LLM推理中，主要瓶颈通常是计算密集型的。然而，代理式工作负载需要反复从片外内存加载大量上下文数据（例如，DOM树、工具调用历史）。这一需求耗尽了可用的内存带宽，迫使计算单元在等待数据时处于空闲状态。

2.2 容量墙 | The Capacity Wall

Agentic contexts can grow to tens of thousands of tokens—well beyond the typical on-chip SRAM capacity. This forces the system to rely on slower, high-capacity off-chip DRAM, creating a second bottleneck. The combined effect of these two walls prevents compute units from achieving high utilization.

代理式上下文可能增长到数万个词元——这远远超出了典型的片上SRAM容量。这迫使系统依赖速度较慢、容量较高但高延迟的片外DRAM，从而形成了第二个瓶颈。这两个内存墙的联合作用阻止了计算单元实现高利用率。

3 PLENA架构概览 | PLENA Architecture Overview

PLENA (Parallel LLM inference Engine with Novel Architecture) is a hardware-software co-design硬件-软件协同设计，同时优化硬件和软件以提升系统性能。ed system built around three core optimization pathways. The system is designed from the ground up to address the specific memory and compute patterns of agentic LLM inference.

PLENA（具备新型架构的并行LLM推理引擎）是一个围绕三条核心优化路径构建的软硬件协同设计系统。该系统是从零开始设计的，旨在解决代理式LLM推理的特定内存和计算模式。


Optimization Pathway	Description (EN)	描述 (CN)
Pathway 1 - Architecture	A novel flattened systolic-array architecture扁平化脉动阵列架构，一种新型计算阵列设计，减少数据移动开销。	新颖的扁平脉动阵列架构
Pathway 2 - Quantization	Efficient compute and memory units supporting an asymmetric quantization非对称量化，一种量化方案，不同权重和激活使用不同精度。 scheme	支持非对称量化方案的高效计算与内存单元
Pathway 3 - Attention	Native hardware support for FlashAttention一种高效注意力机制算法，减少内存访问和计算。	对FlashAttention一种高效注意力机制算法，减少内存访问和计算。的原生硬件支持

4 三条核心优化路径详解 | Detailed Analysis of the Three Pathways

4.1 路径一：扁平脉动阵列架构 | Pathway 1: Flattened Systolic-Array Architecture扁平化脉动阵列架构，一种新型计算阵列设计，减少数据移动开销。

Traditional systolic arrays are designed for regular, dense matrix operations. However, agentic inference involves irregular memory access patterns due to varying context lengths and attention sparsity. PLENA introduces a flattened systolic-array that reconfigures the compute fabric to minimize data movement, allowing for better utilization of memory bandwidth.

传统的脉动阵列是为规则的、密集的矩阵运算设计的。然而，由于上下文长度的变化和注意力的稀疏性，代理式推理涉及不规则的存储器访问模式。PLENA引入了一种扁平脉动阵列，它重构了计算结构以最小化数据移动，从而更好地利用内存带宽。

4.2 路径二：非对称量化方案 | Pathway 2: Asymmetric Quantization非对称量化，一种量化方案，不同权重和激活使用不同精度。 Scheme

PLENA leverages asymmetric quantization非对称量化，一种量化方案，不同权重和激活使用不同精度。 to balance precision and memory footprint. Unlike symmetric methods, asymmetric quantization非对称量化，一种量化方案，不同权重和激活使用不同精度。 can better represent non-uniform distributions of activations and weights, reducing quantization error while taking advantage of lower-bit arithmetic.

PLENA利用非对称量化来平衡精度和内存占用。与对称方法不同，非对称量化能更好地表示激活值和权重的非均匀分布，在利用低位算术的同时减少量化误差。

Quantization Comparison:


Method (EN)	方法 (CN)	Precision (EN)	精度 (CN)	Memory Footprint (EN)	内存占用 (CN)
Symmetric	对称量化	Moderate (symmetric error)	中等（对称误差）	Low	低
Asymmetric (PLENA)	非对称量化（PLENA）	Higher (adaptive range)	更高（自适应范围）	Optimized	优化
FP16 Baseline	FP16基线	Full	全精度	High	高

4.3 路径三：FlashAttention一种高效注意力机制算法，减少内存访问和计算。原生支持 | Pathway 3: Native FlashAttention一种高效注意力机制算法，减少内存访问和计算。 Support

FlashAttention一种高效注意力机制算法，减少内存访问和计算。 reduces memory traffic by tiling the attention computation, avoiding the writing of large attention matrices to slower DRAM. PLENA provides native hardware support for this optimization, implementing dedicated on-chip buffers and dataflow paths that streamline the tiling process directly in hardware.

FlashAttention一种高效注意力机制算法，减少内存访问和计算。通过分块计算注意力来减少内存流量，避免将大型注意力矩阵写入较慢的DRAM。PLENA为这一优化提供了原生硬件支持，实现了专用的片上缓冲区和数据流路径，直接在硬件中简化分块处理过程。

5 完整的软硬件栈 | Complete Software-Hardware Stack

Beyond the hardware innovations, PLENA includes a complete software-hardware stack designed for practical deployment and research.

除了硬件创新，PLENA还包含一个为实际部署和研究设计的完整软硬件栈。


Component (EN)	组件 (CN)	Function (EN)	功能 (CN)
Custom ISA	自定义指令集架构	Defines low-level operations for the PLENA accelerator	为PLENA加速器定义底层操作
Compiler	编译器	Maps LLM computation graphs to PLENA ISA instructions	将LLM计算图映射到PLENA ISA指令
Transaction-Level Simulator	事务级模拟器	Enables cycle-accurate performance modeling	实现周期精确的性能建模
Automated DSE	自动化设计空间探索	Optimizes hardware configurations for target workloads	为目标工作负载优化硬件配置

6 实验评估 | Experimental Evaluation

The researchers conducted extensive evaluations using the LLaMA model for agentic inference tasks. The results demonstrate significant performance and energy efficiency gains over state-of-the-art commercial hardware.

研究人员使用LLaMA模型在代理式推理任务上进行了广泛的评估。结果表明，与最先进的商业硬件相比，性能显著提升，能效也大幅优化。

6.1 性能对比 | Performance Comparison

When compared under identical multiplier counts and memory configurations, PLENA delivers substantial throughput improvements:

在相同的乘法器数量和内存配置下进行比较时，PLENA展现出显著的吞吐量提升：


Metric (EN)	指标 (CN)	vs. A100 GPU	vs. TPU v6e
Throughput	吞吐量	Up to 2.23x	Up to 4.70x
Energy Efficiency	能效	Up to 4.04x	N/A

Experimental results show that PLENA delivers up to 2.23x and 4.70x higher throughput than the A100 GPU and TPU v6e, respectively, under identical multiplier counts and memory configurations during LLaMA agentic inference. PLENA also achieves up to 4.04x higher energy efficiency than the A100 GPU.

实验结果表明，在LLaMA代理式推理过程中，在相同的乘法器数量和内存配置下，PLENA的吞吐量分别比A100 GPU和TPU v6e高出高达2.23倍和4.70倍。同时，PLENA的能效比A100 GPU高出高达4.04倍。

7 开放资源与总结 | Open Source and Conclusion

PLENA represents a significant step forward in specialized hardware for agentic AI workloads. By holistically addressing the bandwidth and capacity walls through a co-designed approach, it achieves orders-of-magnitude efficiency improvements over general-purpose accelerators.

PLENA代表着在为代理式AI工作负载设计的专用硬件方面迈出的重要一步。通过协同设计的方法全面应对带宽墙和容量墙，其效率相比通用加速器实现了数量级的提升。

The full PLENA system—including its simulator, compiler, ISA, and RTL implementation—will be open-sourced to the research community, enabling further innovation and validation.

完整的PLENA系统——包括其模拟器、编译器、指令集架构（ISA）和寄存器传输级（RTL）实现——将向研究社区开源，以促进进一步的创新和验证。

This openness underscores the authors' commitment to advancing the field, allowing researchers and engineers to build upon their work and explore new optimizations for the rapidly evolving landscape of agentic AI.

这一开放性彰显了作者推动该领域发展的承诺，使研究人员和工程师能够基于他们的工作进行构建，并为快速发展的代理式AI领域探索新的优化方案。

Paper Reference: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, et al. PLENA: A Hardware-Software Co-Design硬件-软件协同设计，同时优化硬件和软件以提升系统性能。ed System for Agentic LLM Inference. arXiv:2509.09505, 2025.

常见问题（FAQ）

PLENA是什么？相比A100和TPU性能提升多少？

PLENA是软硬件协同设计的LLM代理推理系统，解决带宽和容量内存墙。吞吐量比A100 GPU提升2.23倍，比TPU v6e提升4.70倍，能效比A100提升4.04倍。

LLM代理推理的内存墙有哪些？PLENA如何应对？

内存墙包括带宽墙（大量上下文数据加载导致带宽饱和）和容量墙（长上下文超出片上SRAM，依赖慢速DRAM）。PLENA通过扁平脉动阵列、非对称量化和FlashAttention一种高效注意力机制算法，减少内存访问和计算。硬件支持来缓解。

PLENA的三条核心优化路径是什么？

三条路径：1）扁平脉动阵列架构，减少数据移动；2）非对称量化方案，平衡精度与内存占用；3）原生硬件支持FlashAttention一种高效注意力机制算法，减少内存访问和计算。，优化注意力计算。

AI Summary (BLUF)