Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

要約

大規模な言語モデルの推論はメモリを大量に消費し、時間がかかるため、効率的に拡張するには分散アルゴリズムが必要になることがよくあります。
マルチ GPU のトレーニングと推論では、さまざまなモデル並列処理戦略が使用され、複数のデバイス間で計算を分割し、メモリ負荷と計算時間を削減します。
ただし、モデルの並列処理を使用すると、GPU 間の情報通信が必要になります。これが大きなボトルネックとなり、デバイスの数をスケールアップすることで得られるメリットが制限されます。
すべての残差ベースのモデルに適用できるシンプルなアーキテクチャ変更である Ladder Residual を導入します。これにより、通信の遅延を効果的に隠す直接的なオーバーラップが可能になります。
私たちの洞察は、システムの最適化に加えて、モデルアーキテクチャを再設計して通信を計算から切り離すこともできるということです。
Ladder Residual では、従来の並列処理パターンでの通信と計算の分離が可能ですが、この論文では、特に大量の通信がボトルネックとなっている Tensor 並列処理に焦点を当てます。
70B パラメーターを持つ Transformer モデルの場合、Ladder Residual をすべてのレイヤーに適用すると、8 デバイスにわたる TP シャーディングによる推論時のエンドツーエンドの壁時計速度の 30% 向上を達成できます。
結果として得られるトランスフォーマーモデルをラダートランスフォーマーと呼びます。
1B および 3B のラダートランスを最初からトレーニングし、標準の高密度トランスのベースラインと同等のパフォーマンスを観察しました。
また、3B トークンの再トレーニングのみによって、精度の低下を最小限に抑えながら、Llama-3.1 8B モデルの一部を Ladder Residual アーキテクチャに変換できることも示します。

要約(オリジナル)

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 30% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

arxiv情報

著者	Muru Zhang,Mayank Mishra,Zhongzhu Zhou,William Brandon,Jue Wang,Yoon Kim,Jonathan Ragan-Kelley,Shuaiwen Leon Song,Ben Athiwaratkun,Tri Dao
発行日	2025-01-21 14:33:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー