PLENA is a hardware-software co-designed system for LLM agentic inference that addresses bandwidth and capacity memory walls. It features a flattened systolic-array architecture, asymmetric quantization, and FlashAttention support, achieving up to 2.23x and 4.70x throughput improvements over A100 GPU and TPU v6e, respectively, and 4.04x better energy efficiency than A100.
原文翻译:
PLENA是一个硬件-软件协同设计的系统,针对LLM代理推理,解决带宽和容量内存墙问题。它采用扁平化脉动阵列架构、非对称量化和FlashAttention支持,相比A100 GPU和TPU v6e,吞吐量分别提升2.23倍和4.70倍,能效比A100提升4.04倍。PLENA is a hardware-software co-designed system for LLM agentic inference that addresses bandwidth and capacity memory walls. It features a flattened systolic-array architecture, asymmetric quantization, and FlashAttention support, achieving up to 2.23x and 4.70x throughput improvements over A100 GPU and TPU v6e, respectively, and 4.04x better energy efficiency than A100.
原文翻译:
PLENA是一个硬件-软件协同设计的系统,针对LLM代理推理,解决带宽和容量内存墙问题。它采用扁平化脉动阵列架构、非对称量化和FlashAttention支持,相比A100 GPU和TPU v6e,吞吐量分别提升2.23倍和4.70倍,能效比A100提升4.04倍。