Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

要約

最近、大規模言語モデル (LLM) によって自然言語処理が変革され、機械が人間のようなテキストを生成し、意味のある会話を行えるようになりました。
これらのシステムの計算要件とメモリ要件が指数関数的に増大するため、この開発では LLM 推論の速度、効率、アクセスしやすさが必要になります。
一方、コンピューティングとメモリの能力の進歩は遅れており、ムーアの法則の廃止によってさらに悪化しています。
LLM は単一 GPU の容量を超えるため、並列処理には複雑な専門家レベルの構成が必要です。
メモリアクセスは計算よりも大幅にコストが高くなり、メモリウォールとして知られる効率的なスケーリングに課題が生じます。
ここで、コンピューティングインメモリ (CIM) テクノロジは、メモリ内でアナログ計算を直接実行することで AI 推論を高速化する有望なソリューションを提供し、潜在的に遅延と電力消費を削減します。
CIM はメモリとコンピューティング要素を緊密に統合することにより、ノイマン型のボトルネックを解消し、データの移動を削減し、エネルギー効率を向上させます。
この調査ペーパーでは、トランスベースのモデルの概要と分析を提供し、さまざまな CIM アーキテクチャをレビューし、最新の AI コンピューティングシステムの差し迫った課題にどのように対処できるかを検討します。
変圧器関連のオペレーターとそのハードウェアアクセラレーションスキームについて説明し、対応する CIM 設計における課題、傾向、洞察を強調します。

要約(オリジナル)

Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore’s law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs.

arxiv情報

著者	Christopher Wolters,Xiaoxuan Yang,Ulf Schlichtmann,Toyotaro Suzumura
発行日	2024-06-12 16:57:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー