GEO
赞助商内容

如何提升LLM代理推理效率?PLENA硬件系统实现吞吐量2.23倍提升(2026年)

2026/4/25
如何提升LLM代理推理效率?PLENA硬件系统实现吞吐量2.23倍提升(2026年)

AI Summary (BLUF)

PLENA is a hardware-software co-designed system for LLM agentic inference that addresses bandwidth and capacity memory walls. It features a flattened systolic-array architecture, asymmetric quantization, and FlashAttention support, achieving up to 2.23x and 4.70x throughput improvements over A100 GPU and TPU v6e, respectively, and 4.04x better energy efficiency than A100.

原文翻译: PLENA是一个硬件-软件协同设计的系统,针对LLM代理推理,解决带宽和容量内存墙问题。它采用扁平化脉动阵列架构、非对称量化和FlashAttention支持,相比A100 GPU和TPU v6e,吞吐量分别提升2.23倍和4.70倍,能效比A100提升4.04倍。

1 引言 | Introduction

Large Language Models (LLMs) are evolving beyond simple chatbots. They now serve as the backbone of AI agents capable of using tools, executing commands, interacting with web browsers, and controlling computer interfaces. These "agentic" workloads impose fundamentally different demands compared to traditional conversational AI.

大型语言模型(LLM)正在从简单的聊天机器人进化。它们现在构成了AI代理的核心,这些代理能够使用工具、执行命令、与网页浏览器交互以及控制计算机界面。与传统的对话式人工智能相比,这些“代理式”工作负载在本质上有不同的需求。

Agentic inference tasks involve much longer context lengths to capture complex and prolonged inputs—such as an entire webpage DOM or a series of tool-call trajectories. This extended context creates significant off-chip memory traffic, causing the workload to be constrained by two memory walls: the bandwidth wall and the capacity wall.

代理式推理任务涉及更长的上下文长度,以捕获复杂且持续的输入——例如整个网页的DOM结构或一系列工具调用轨迹。这种扩展的上下文产生了大量的片外内存流量,导致工作负载受到两个内存墙的限制:带宽墙容量墙

2 核心挑战:LLM代理的内存墙 | The Core Challenge: Memory Walls in LLM Agents

To understand why PLENA is needed, we must first examine the unique memory challenges posed by agentic LLM inference.

要理解为何需要PLENA,我们必须首先审视代理式LLM推理带来的独特内存挑战。

2.1 带宽墙 | The Bandwidth Wall

In conventional LLM inference, the primary bottleneck is often compute-bound. However, agentic workloads require repeatedly loading large amounts of context data (e.g., DOM trees, tool histories) from off-chip memory. This demand saturates the available memory bandwidth, forcing compute units to idle while waiting for data.

在传统的LLM推理中,主要瓶颈通常是计算密集型的。然而,代理式工作负载需要反复从片外内存加载大量上下文数据(例如,DOM树、工具调用历史)。这一需求耗尽了可用的内存带宽,迫使计算单元在等待数据时处于空闲状态。

2.2 容量墙 | The Capacity Wall

Agentic contexts can grow to tens of thousands of tokens—well beyond the typical on-chip SRAM capacity. This forces the system to rely on slower, high-capacity off-chip DRAM, creating a second bottleneck. The combined effect of these two walls prevents compute units from achieving high utilization.

代理式上下文可能增长到数万个词元——这远远超出了典型的片上SRAM容量。这迫使系统依赖速度较慢、容量较高但高延迟的片外DRAM,从而形成了第二个瓶颈。这两个内存墙的联合作用阻止了计算单元实现高利用率。

3 PLENA架构概览 | PLENA Architecture Overview

PLENA (Parallel LLM inference Engine with Novel Architecture) is a hardware-software co-designed system built around three core optimization pathways. The system is designed from the ground up to address the specific memory and compute patterns of agentic LLM inference.

PLENA(具备新型架构的并行LLM推理引擎)是一个围绕三条核心优化路径构建的软硬件协同设计系统。该系统是从零开始设计的,旨在解决代理式LLM推理的特定内存和计算模式。

Optimization Pathway Description (EN) 描述 (CN)
Pathway 1 - Architecture A novel flattened systolic-array architecture 新颖的扁平脉动阵列架构
Pathway 2 - Quantization Efficient compute and memory units supporting an asymmetric quantization scheme 支持非对称量化方案的高效计算与内存单元
Pathway 3 - Attention Native hardware support for FlashAttention FlashAttention的原生硬件支持

4 三条核心优化路径详解 | Detailed Analysis of the Three Pathways

4.1 路径一:扁平脉动阵列架构 | Pathway 1: Flattened Systolic-Array Architecture

Traditional systolic arrays are designed for regular, dense matrix operations. However, agentic inference involves irregular memory access patterns due to varying context lengths and attention sparsity. PLENA introduces a flattened systolic-array that reconfigures the compute fabric to minimize data movement, allowing for better utilization of memory bandwidth.

传统的脉动阵列是为规则的、密集的矩阵运算设计的。然而,由于上下文长度的变化和注意力的稀疏性,代理式推理涉及不规则的存储器访问模式。PLENA引入了一种扁平脉动阵列,它重构了计算结构以最小化数据移动,从而更好地利用内存带宽。

4.2 路径二:非对称量化方案 | Pathway 2: Asymmetric Quantization Scheme

PLENA leverages asymmetric quantization to balance precision and memory footprint. Unlike symmetric methods, asymmetric quantization can better represent non-uniform distributions of activations and weights, reducing quantization error while taking advantage of lower-bit arithmetic.

PLENA利用非对称量化来平衡精度和内存占用。与对称方法不同,非对称量化能更好地表示激活值和权重的非均匀分布,在利用低位算术的同时减少量化误差。

Quantization Comparison:

Method (EN) 方法 (CN) Precision (EN) 精度 (CN) Memory Footprint (EN) 内存占用 (CN)
Symmetric 对称量化 Moderate (symmetric error) 中等(对称误差) Low
Asymmetric (PLENA) 非对称量化(PLENA) Higher (adaptive range) 更高(自适应范围) Optimized 优化
FP16 Baseline FP16基线 Full 全精度 High

4.3 路径三:FlashAttention原生支持 | Pathway 3: Native FlashAttention Support

FlashAttention reduces memory traffic by tiling the attention computation, avoiding the writing of large attention matrices to slower DRAM. PLENA provides native hardware support for this optimization, implementing dedicated on-chip buffers and dataflow paths that streamline the tiling process directly in hardware.

FlashAttention通过分块计算注意力来减少内存流量,避免将大型注意力矩阵写入较慢的DRAM。PLENA为这一优化提供了原生硬件支持,实现了专用的片上缓冲区和数据流路径,直接在硬件中简化分块处理过程。

5 完整的软硬件栈 | Complete Software-Hardware Stack

Beyond the hardware innovations, PLENA includes a complete software-hardware stack designed for practical deployment and research.

除了硬件创新,PLENA还包含一个为实际部署和研究设计的完整软硬件栈。

Component (EN) 组件 (CN) Function (EN) 功能 (CN)
Custom ISA 自定义指令集架构 Defines low-level operations for the PLENA accelerator 为PLENA加速器定义底层操作
Compiler 编译器 Maps LLM computation graphs to PLENA ISA instructions 将LLM计算图映射到PLENA ISA指令
Transaction-Level Simulator 事务级模拟器 Enables cycle-accurate performance modeling 实现周期精确的性能建模
Automated DSE 自动化设计空间探索 Optimizes hardware configurations for target workloads 为目标工作负载优化硬件配置

6 实验评估 | Experimental Evaluation

The researchers conducted extensive evaluations using the LLaMA model for agentic inference tasks. The results demonstrate significant performance and energy efficiency gains over state-of-the-art commercial hardware.

研究人员使用LLaMA模型在代理式推理任务上进行了广泛的评估。结果表明,与最先进的商业硬件相比,性能显著提升,能效也大幅优化。

6.1 性能对比 | Performance Comparison

When compared under identical multiplier counts and memory configurations, PLENA delivers substantial throughput improvements:

相同的乘法器数量和内存配置下进行比较时,PLENA展现出显著的吞吐量提升:

Metric (EN) 指标 (CN) vs. A100 GPU vs. TPU v6e
Throughput 吞吐量 Up to 2.23x Up to 4.70x
Energy Efficiency 能效 Up to 4.04x N/A

Experimental results show that PLENA delivers up to 2.23x and 4.70x higher throughput than the A100 GPU and TPU v6e, respectively, under identical multiplier counts and memory configurations during LLaMA agentic inference. PLENA also achieves up to 4.04x higher energy efficiency than the A100 GPU.

实验结果表明,在LLaMA代理式推理过程中,在相同的乘法器数量和内存配置下,PLENA的吞吐量分别比A100 GPU和TPU v6e高出高达2.23倍4.70倍。同时,PLENA的能效比A100 GPU高出高达4.04倍

7 开放资源与总结 | Open Source and Conclusion

PLENA represents a significant step forward in specialized hardware for agentic AI workloads. By holistically addressing the bandwidth and capacity walls through a co-designed approach, it achieves orders-of-magnitude efficiency improvements over general-purpose accelerators.

PLENA代表着在为代理式AI工作负载设计的专用硬件方面迈出的重要一步。通过协同设计的方法全面应对带宽墙和容量墙,其效率相比通用加速器实现了数量级的提升。

The full PLENA system—including its simulator, compiler, ISA, and RTL implementation—will be open-sourced to the research community, enabling further innovation and validation.

完整的PLENA系统——包括其模拟器、编译器、指令集架构(ISA)和寄存器传输级(RTL)实现——将向研究社区开源,以促进进一步的创新和验证。

This openness underscores the authors' commitment to advancing the field, allowing researchers and engineers to build upon their work and explore new optimizations for the rapidly evolving landscape of agentic AI.

这一开放性彰显了作者推动该领域发展的承诺,使研究人员和工程师能够基于他们的工作进行构建,并为快速发展的代理式AI领域探索新的优化方案。


Paper Reference: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, et al. PLENA: A Hardware-Software Co-Designed System for Agentic LLM Inference. arXiv:2509.09505, 2025.

常见问题(FAQ)

PLENA是什么?相比A100和TPU性能提升多少?

PLENA是软硬件协同设计的LLM代理推理系统,解决带宽和容量内存墙。吞吐量比A100 GPU提升2.23倍,比TPU v6e提升4.70倍,能效比A100提升4.04倍。

LLM代理推理的内存墙有哪些?PLENA如何应对?

内存墙包括带宽墙(大量上下文数据加载导致带宽饱和)和容量墙(长上下文超出片上SRAM,依赖慢速DRAM)。PLENA通过扁平脉动阵列、非对称量化和FlashAttention硬件支持来缓解。

PLENA的三条核心优化路径是什么?

三条路径:1)扁平脉动阵列架构,减少数据移动;2)非对称量化方案,平衡精度与内存占用;3)原生硬件支持FlashAttention,优化注意力计算。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。

您可能感兴趣