DeMo: Decoupled Momentum Optimization

要約

大規模なニューラルネットワークをトレーニングするには、通常、専用の高速相互接続を介してアクセラレータ間で勾配を共有する必要があります。
周波数分解とエネルギー圧縮の信号処理原理に基づいて、トレーニング中に完全なオプティマイザーの状態とモデルパラメーターを同期する必要がないことを示します。
モメンタムの更新を切り離し、アクセラレータ間でオプティマイザーの状態の発散を制御できるようにすることで、最先端のオプティマイザーと比較して収束性の向上を実現します。
私たちは、アクセラレータ間の通信要件を数桁削減するオプティマイザとデータ並列アルゴリズムを融合した {\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo) を導入します。
これにより、限られたネットワーク帯域幅や異種ハードウェアでも大規模なニューラルネットワークのトレーニングが可能になります。
私たちの方法はトポロジに依存せず、アーキテクチャに依存せず、ごくわずかなコンピューティングとメモリのオーバーヘッドでスケーラブルなクロック同期分散トレーニングをサポートします。
実証結果は、DeMo でトレーニングされたモデルが、AdamW でトレーニングされた同等のモデルのパフォーマンスと同等またはそれを超え、大規模な基礎モデルを事前トレーニングする際に高速相互接続の必要性を排除することを示しています。
オープンソースのリファレンス PyTorch 実装が GitHub (https://github.com/bloc97/DeMo) で公開されています。

要約(オリジナル)

Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at https://github.com/bloc97/DeMo

arxiv情報

著者	Bowen Peng,Jeffrey Quesnelle,Diederik P. Kingma
発行日	2024-11-29 17:31:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeMo: Decoupled Momentum Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー