L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

要約

大規模な言語モデル（LLMS）は、長いテキストシーケンスを処理する必要がありますが、GPUメモリの制限により、メモリ容量と帯域幅の間の困難なトレードオフが強制されます。
HBMベースの加速は高い帯域幅を提供しますが、その容量は抑制されたままです。
ホスト側のDIMMにデータをオフロードすると、容量が向上しますが、費用のかかるデータは頭上を交換します。
重要なメモリボトルネックは、マルチヘッド注意（MHA）のみのデコードフェーズにあることを特定します。これは、KVキャッシュと高い帯域幅を注意計算に保存するための実質的な能力を必要とすることを特定します。
私たちの重要な洞察は、この操作が最新のDIMMベースのメモリ（PIM）アーキテクチャと独自に整合していることを明らかにしています。これは、容量と帯域幅の両方のスケーラビリティを提供します。
この観察と洞察に基づいて、DIMM-PIMとGPUデバイスを統合するハードウェアソフトウェアの共同設計システムであるL3を提案します。
L3は3つのイノベーションを導入します。最初に、ハードウェアの再設計データレイアウトの不一致とDIMM-PIMの計算要素の不一致を再設計し、LLM推論の利用を強化します。
第二に、通信の最適化により、データ転送が計算とともに頭上に隠れることができます。
第三に、適応型スケジューラはGPU-DIMM-PIM操作を調整して、デバイス間の並列性を最大化します。
現実世界のトレースを使用した評価は、L3が最先端のHBM-PIMソリューションよりも最大6.1 $ \ Times $速度を達成し、バッチサイズを大幅に改善することを示しています。

要約(オリジナル)

Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1$\times$ speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes.

arxiv情報

著者	Qingyuan Liu,Liyan Chen,Yanning Yang,Haocheng Wang,Dong Du,Zhigang Mao,Naifeng Jing,Yubin Xia,Haibo Chen
発行日	2025-04-24 14:14:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー