Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

要約

エキスパート並列は、スパースにゲートされたMoE（Mixture-of-Experts）モデルの計算負荷を複数のコンピューティングデバイスに分散させる戦略として導入され、大規模化するモデルの実行を容易にしている。しかし、エキスパート並列に内在するAll-to-All通信は重大なオーバーヘッドとなり、MoEモデルの効率を低下させます。現在の最適化アプローチは、ある程度の緩和策を提供するが、通信と計算操作の逐次的な相互依存によって制約を受ける。この限界に対処するため、我々はオーバーラップ並列戦略を持つ新しいショートカット接続MoE（ScMoE）アーキテクチャを提示する。このアーキテクチャは、通信を従来の順序から効果的に切り離し、計算と70％から100％の大幅なオーバーラップを可能にする。一般的なトップ2MoEアーキテクチャと比較した場合、ScMoEは、PCIeおよびNVLinkハードウェアを使用した分散環境において、それぞれ30%および11%のトレーニング速度の向上、および40%および15%の推論速度の向上を示しています。ScMoEアーキテクチャをベースに、さらにエキスパートオフロードストラテジーを実装することで、メモリ制限のある推論を容易にし、エキスパートのマイグレーションをオーバーラップさせることでレイテンシを最適化しています。さらに、広範な実験と理論的分析により、ScMoEは既存のアプローチに匹敵するだけでなく、場合によってはそれを上回るモデル品質を達成することが示された。

要約(オリジナル)

Expert parallelism has been introduced as a strategy to distribute the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple computing devices, facilitating the execution of these increasingly large-scale models. However, the All-to-All communication intrinsic to expert parallelism constitutes a significant overhead, diminishing the MoE models’ efficiency. Current optimization approaches offer some relief, yet they are constrained by the sequential interdependence of communication and computation operations. To address this limitation, we present a novel shortcut-connected MoE (ScMoE) architecture with an overlapping parallel strategy, which effectively decouples communication from its conventional sequence, allowing for a substantial overlap of 70% to 100% with computation. When compared with the prevalent top-2 MoE architecture, ScMoE demonstrates training speed improvements of 30% and 11%, and inference improvements of 40% and 15%, in our distributed environments with PCIe and NVLink hardware, respectively, where communication constitutes 60% and 15% of the total MoE time consumption. Building on the ScMoE architecture, we further implement an expert offloading strategy to facilitate memory-limited inference, optimizing latency through the overlap of expert migration. Additionally, extensive experiments and theoretical analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.

arxiv情報

著者	Weilin Cai,Juyong Jiang,Le Qin,Junwei Cui,Sunghun Kim,Jiayi Huang
発行日	2024-11-01 08:55:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー