Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

要約

Experts（MOE）の混合モデルは、コンピューティングを対応する線形増加なしにモデルサイズをスケーリングし続けるために、広く採用されたソリューションになりました。
MOEモデルトレーニング中、各入力トークンは、各トランス層内の専門家のサブセット（まばらに作動するフィードフォワードネットワーク）のサブセットに動的にルーティングされます。
各専門家に割り当てられたトークンの分布は、トレーニングの過程で大きく迅速に異なります。
専門家間の幅広い負荷の不均衡を処理するために、現在のシステムは、人気のある専門家に割り当てられたトークンをドロップし、収束を分解する、または人気に基づいて各専門家に割り当てられた頻繁にリバランスリソースを頻繁にリバランスすることを余儀なくされ、高い状態移行オーバーヘッドが発生します。
このパフォーマンスのアクセラシーのトレードオフを破るために、適応型MOEトレーニングシステムであるSwiftMoeを紹介します。
SwiftMoeの重要な洞察は、大規模なオプティマイザー状態からの専門家パラメーターの配置を分離することです。
Swiftmoeは、すべてのトレーニングノードにわたって各専門家のオプティマイザーを静的に分割します。
一方、SwiftMoeは、既存の重量の更新を再利用して、移行オーバーヘッドを回避することにより、専門家のパラメーターの配置を動的に調整します。
そうすることで、SwiftMoeは、適格ごとに各専門家に割り当てられたGPUリソースを最小限のオーバーヘッドで右サイズにサイズします。
SwiftMoeは、最先端のMOEトレーニングシステム、Deepspeed、FlexMoeと比較して、それぞれ30.5％と25.9％の時間を速くすることができます。

要約(オリジナル)

Mixture-of-Experts (MoE) models have become a widely adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts — sparsely-activated feed-forward networks — within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads. To break this performance-accuracy tradeoff, we introduce SwiftMoE, an adaptive MoE training system. The key insight of SwiftMoE is to decouple the placement of expert parameters from their large optimizer state. SwiftMoE statically partitions the optimizer of each expert across all training nodes. Meanwhile, SwiftMoE dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SwiftMoE right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overheads. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SwiftMoE is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively.

arxiv情報

著者	Athinagoras Skiadopoulos,Mark Zhao,Swapnil Gandhi,Thomas Norrie,Shrijeet Mukherjee,Christos Kozyrakis
発行日	2025-04-28 15:58:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー