内存计算(CIM)如何解决AI能效危机:从理论到架构突破
Memory computing (CIM) addresses AI's energy crisis by processing data where it's stored, reducing data movement energy by up to 80% and enabling sustainable zettascale computing. (内存计算(CIM)通过在数据存储位置直接处理数据解决AI能源危机,减少高达80%的数据移动能耗,实现可持续的泽塔级计算。)
Executive Summary (执行摘要)
Skyrocketing AI compute workloads and fixed power budgets are forcing chip and system architects to take a much harder look at compute in memory (CIM), which until recently was considered little more than a science project.
AI计算工作负载的激增与固定的功率预算正迫使芯片和系统架构师更加认真地审视内存计算(CIM)将计算单元集成到内存阵列中的架构技术,通过在数据存储位置直接处理数据,减少处理器与内存之间的数据移动,从而降低能耗和延迟。,这项技术直到最近还被视为仅仅是科研项目。
The AI Energy Crisis (AI能源危机)
The Zettascale Challenge (泽塔级计算每秒至少执行10^21次运算的计算能力级别,是当前AI模型规模指数增长所需的下一个计算里程碑。挑战)
In a keynote address at the recent Hot Chips 2023 conference, Google Chief Scientist Jeff Dean observed that model sizes and the associated computing requirements are increasing by as much as a factor of 10 each year.[1] And while zettascale computing (at least 1021 operations per second) is within reach, it carries a high price tag.
在最近的Hot Chips 2023会议上,谷歌首席科学家Jeff Dean在主题演讲中指出,模型规模及相关计算需求每年增长高达10倍。[1] 虽然泽塔级计算每秒至少执行10^21次运算的计算能力级别,是当前AI模型规模指数增长所需的下一个计算里程碑。(每秒至少10^21次运算)触手可及,但其代价高昂。
Case in point: Lisa Su, chair and CEO of AMD, observed that if current trends continue, the first zettascale computer will require 0.5 gigawatts of power, or about half the output of a typical nuclear power plant for a single system.[2] In a world increasingly concerned about energy demand and energy-related carbon emissions, the assumption that data centers can grow indefinitely is no longer valid.
例如:AMD董事长兼CEO Lisa Su指出,如果当前趋势持续,首个泽塔级计算每秒至少执行10^21次运算的计算能力级别,是当前AI模型规模指数增长所需的下一个计算里程碑。机将需要0.5吉瓦的功率,相当于单个系统消耗典型核电站一半的输出功率。[2] 在日益关注能源需求和碳排放的世界中,数据中心可以无限增长的假设已不再成立。
Why CIM Matters (CIM为何重要)
The Memory Wall Problem (内存墙指处理器性能增长远快于内存带宽提升导致的系统瓶颈,表现为大量计算时间浪费在等待数据从内存传输。问题)
CIM solves two problems. First, it takes more energy to move data back and forth between memory and processor than to actually process it. And second, there is so much data being collected through sensors and other sources and parked in memory, that it's faster to pre-process at least some of that data where it is being stored.
CIM解决两个问题:首先,在内存和处理器之间来回移动数据比实际处理数据消耗更多能量;其次,通过传感器等来源收集并存储在内存中的数据如此之多,在存储位置预处理至少部分数据速度更快。
Machine learning models have massive data transfer needs relative to their modest computing requirements. In neural networks, both the inference and training stages typically involve multiplying a large matrix (A) by some input vector (αx), and adding a bias term (βy) to the result.
机器学习模型相对于其适中的计算需求具有巨大的数据传输需求。在神经网络中,推理和训练阶段通常涉及将大矩阵(A)乘以输入向量(αx),并向结果添加偏置项(βy)。
Some models use millions or even billions of parameters. With such large matrices, reading and writing the data to be operated on may take much longer than the calculation itself. Chat GPT, the large language model, is an example. The memory-bound portion of the workload accounts for as much as 80% of total execution time.[3]
某些模型使用数百万甚至数十亿参数。对于如此大的矩阵,读写待操作数据可能比计算本身耗时更长。大型语言模型Chat GPT就是一个例子,其工作负载中内存受限部分占总执行时间的80%。[3]
CIM Architecture Approaches (CIM架构方法)
Hybrid Computing Architectures (混合计算架构)
Designing efficient CIM architectures is non-trivial, though. In work presented at this year's VLSI Symposium, researcher Yuhao Ju and colleagues at Northwestern University considered AI-related tasks for robotics applications.[5] Here, general-purpose computing accounts for more than 75% of the total workload, including such tasks as trajectory tracking and camera localization.
然而,设计高效的CIM架构并非易事。在今年VLSI研讨会上,西北大学研究员Yuhao Ju及其同事考虑了机器人应用的AI相关任务。[5] 其中,通用计算占总工作负载的75%以上,包括轨迹跟踪和相机定位等任务。
One possible solution, seen in designs like Samsung's LPDDR-PIM accelerator module, relies on a simple, but general-purpose calculation module, optimized for matrix multiplication or some other arithmetic operation. Software tools designed to manage memory-coupled computing assume the job of effectively partitioning the workload.
一种可能的解决方案(如三星LPDDR-PIM加速器模块设计中所示)依赖于简单但通用的计算模块,针对矩阵乘法或其他算术运算进行优化。设计用于管理内存耦合计算的软件工具承担有效划分工作负载的任务。
Emerging Memory Technologies (新兴存储技术)
Reis and colleagues designed a configurable memory array based on FeFETs to accelerate a recommendation system. Each array can operate in RAM mode to read and write lookup tables, perform Boolean logic and arithmetic operations in GPCiM (general purpose compute-in-memory) mode, or operate in content-addressable memory (CAM) mode to search the entire array in parallel.
Reis及其同事设计了基于FeFET铁电场效应晶体管,一种新兴的非易失性存储技术,可用于构建可重构的内存计算阵列。的可配置内存阵列以加速推荐系统。每个阵列可在RAM模式下读写查找表,在GPCiM(通用内存计算)模式下执行布尔逻辑和算术运算,或在内容可寻址内存(CAM)模式下并行搜索整个阵列。
Part of the appeal of 3D integration is the potential to improve performance by increasing bandwidth and reducing the data path length. Yiwei Du and colleagues at Tsinghua University built an HfO2/TaOx ReRAM电阻式随机存取存储器,利用材料电阻变化存储数据,适合构建高密度、低功耗的内存计算阵列。 array on top of conventional CMOS logic, then added a third layer with InGaZnOx FeFET铁电场效应晶体管,一种新兴的非易失性存储技术,可用于构建可重构的内存计算阵列。 transistors.
3D集成的部分吸引力在于通过增加带宽和减少数据路径长度来提高性能的潜力。清华大学杜一伟及其同事在传统CMOS逻辑上构建了HfO2/TaOx ReRAM电阻式随机存取存储器,利用材料电阻变化存储数据,适合构建高密度、低功耗的内存计算阵列。阵列,然后添加了带有InGaZnOx FeFET铁电场效应晶体管,一种新兴的非易失性存储技术,可用于构建可重构的内存计算阵列。晶体管的第三层。
Industry Implications (行业影响)
Memory vendors like Samsung and Hynix have been showing compute-in-memory concepts at conferences like Hot Chips for several years. As Dean pointed out, though, traditional data center metrics have devalued energy efficiency in favor of absolute performance. Such performance-first metrics are no longer sufficient in an increasingly power-constrained environment.
三星和海力士等内存供应商多年来一直在Hot Chips等会议上展示内存计算概念。然而,正如Dean指出的,传统数据中心指标低估了能效而偏向绝对性能。在日益受限的功率环境中,这种性能优先的指标已不再足够。
Frequently Asked Questions (常见问题)
什么是内存计算(CIM)将计算单元集成到内存阵列中的架构技术,通过在数据存储位置直接处理数据,减少处理器与内存之间的数据移动,从而降低能耗和延迟。?
内存计算是一种将计算单元集成到内存阵列中的架构,通过在数据存储位置直接处理数据来减少数据传输能耗和延迟。
CIM如何解决AI能耗问题?
CIM通过减少处理器与内存之间的数据移动来降低能耗,研究表明某些AI工作负载中80%的执行时间受内存限制,CIM可显著改善这一瓶颈。
当前CIM面临哪些技术挑战?
包括架构设计复杂性、算法与硬件的协同优化、新兴存储技术的可靠性问题,以及软件工具链的成熟度等挑战。
哪些应用最适合CIM架构?
推荐系统、神经网络推理、图计算等内存密集型且计算相对简单的AI工作负载最适合CIM架构。
CIM的商业化进展如何?
三星、海力士等厂商已展示原型产品,学术界与工业界正加速研发,预计未来3-5年将在特定AI加速场景实现商业化部署。
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。