FlexDeMo: Decoupled Momentum Optimization for Fully and Hybrid Sharded Training

要約

大規模なニューラルネットワークモデルのトレーニングには、多くの場合、いくつかのノードとアクセラレータに分布する広範な計算リソースが必要です。
最近の発見は、勾配の速い移動コンポーネントのみを交換し、局所的に勢いを蓄積するのに十分である可能性があることを示唆しています（分離された運動量、またはデモ）。
ただし、単一の加速に適合しないより大きなモデルを考慮する場合、勾配情報の交換とデモの統合を再考する必要があります。
ここでは、ハイブリッド戦略であるFlexDemoを採用することを提案します。これにより、ノードは異なるGPUとノード間通信の間で局所的に完全に同期し、急速に変動するコンポーネントのみを使用して改善されます。
これは、以前のハイブリッドシャーディング戦略と分離した勢いの利点を効果的に組み合わせています。
実験結果は、FlexDemoが検証損失の観点からADAMWと同等であり、その生存率を示していることを示しています。

要約(オリジナル)

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, when considering larger models that do not fit on a single accelerate, the exchange of gradient information and the integration of DeMo needs to be reconsidered. Here, we propose employing a hybrid strategy, FlexDeMo, whereby nodes fully synchronize locally between different GPUs and inter-node communication is improved through only using the fast-moving components. This effectively combines previous hybrid sharding strategies with the advantages of decoupled momentum. Our experimental results show that FlexDeMo is on par with AdamW in terms of validation loss, demonstrating its viability.

arxiv情報

著者	Mogens Henrik From,Jacob Nielsen,Lukas Galke,Peter Schneider-Kamp
発行日	2025-02-10 17:55:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FlexDeMo: Decoupled Momentum Optimization for Fully and Hybrid Sharded Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー