Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

要約

大規模な言語モデルの推論は、メモリ集約的で時間がかかる両方であり、多くの場合、効率的にスケーリングするために分散アルゴリズムが必要です。
さまざまなモデルの並列性戦略が、複数のデバイスにわたってパーティション計算に対するマルチGPUトレーニングと推論に使用され、メモリ負荷と計算時間が短縮されます。
ただし、モデルの並列性を使用すると、GPU間の情報の通信が必要です。これは、主要なボトルネックであり、デバイスの数をスケーリングすることで得られるゲインを制限します。
はしごResidualを紹介します。これは、通信の遅延を効果的に隠す簡単なオーバーラップを可能にするすべての残差ベースのモデルに適用される単純なアーキテクチャ変更です。
私たちの洞察は、システムの最適化に加えて、モデルアーキテクチャを再設計して、通信を計算から切り離すこともできるということです。
はしごの残留は、従来の並列性パターンでのコミュニケーションコンポーションデカップリングを可能にすることができますが、このペーパーのテンソル並列性に焦点を当てます。これは、その重いコミュニケーションによって特にボトルネックされています。
70Bパラメーターを備えたトランスモデルの場合、すべてのレイヤーにはしごの残差を適用すると、8つのデバイスを超えるTPシャードを使用して、推測時間に29％のエンドツーエンドのウォールクロック速度を達成できます。
結果の変圧器モデルをはしごトランスと呼びます。
1Bおよび3Bのラダートランスをゼロから訓練し、同等のパフォーマンスを標準の高密度変圧器ベースラインに観察します。
また、3Bトークンの再トレーニングのみで最小限の精度分解で、Llama-3.18Bモデルの一部をはしごの残留アーキテクチャに変換することが可能であることを示しています。
実験の複製を容易にするために、トレーニングと推論のためのコードをリリースします。

要約(オリジナル)

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. We release our code for training and inference for easier replication of experiments.

arxiv情報

著者	Muru Zhang,Mayank Mishra,Zhongzhu Zhou,William Brandon,Jue Wang,Yoon Kim,Jonathan Ragan-Kelley,Shuaiwen Leon Song,Ben Athiwaratkun,Tri Dao
発行日	2025-02-07 08:23:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー