MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

要約

大規模な混合物（MOE）モデルの効率的なトレーニングに合わせた生産システムであるMegascale-Moeを紹介します。
Moeは、大きな言語モデル（LLM）を前例のないサイズにスケーリングするための有望なアーキテクチャとして浮上し、それによりモデルのパフォーマンスが向上します。
ただし、既存のMOEトレーニングシステムは、MOEモデルのエスカレートスケールとハードウェアの継続的な進化によって悪化するトレーニング効率の低下を経験します。
MOEトレーニングの強化における効率的なコミュニケーションの極めて重要な役割を認識して、Megascale-Moeは、各MOE層の注意とFFNのコミュニケーション効率の高い並列性戦略をカスタマイズし、術中レベルと術中レベルの両方で計算とオーバーラップするための総合的なアプローチを採用します。
さらに、Megascale-Moeは、調整された通信パターンを備えた通信圧縮を適用して精度を低くし、トレーニング効率をさらに向上させます。
1,440 Nvidia Hopper GPUで352B MOEモデルをトレーニングするとき、Megascale-Moeは1.41mトークン/sのトレーニングスループットを達成し、Megatron-LMと比較して効率を1.88 $ \ Times $に改善します。
MOEトレーニングの加速における運用経験を共有し、システム設計の洞察を提供することで、この作業がMOEシステムでの将来の研究を動機付けることを期待しています。

要約(オリジナル)

We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

arxiv情報

著者	Chao Jin,Ziheng Jiang,Zhihao Bai,Zheng Zhong,Juncai Liu,Xiang Li,Ningxin Zheng,Xi Wang,Cong Xie,Qi Huang,Wen Heng,Yiyuan Ma,Wenlei Bao,Size Zheng,Yanghua Peng,Haibin Lin,Xuanzhe Liu,Xin Jin,Xin Liu
発行日	2025-05-19 06:12:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー