Ordered Momentum for Asynchronous SGD

要約

分散学習は、大規模なディープモデルをトレーニングするために不可欠です。
非同期 SGD (ASGD) とそのバリアントは、特にクラスター内のワーカーのコンピューティング能力が異種であるシナリオで一般的に使用される分散学習手法です。
Momentum は、ディープモデルトレーニングにおける最適化と一般化の両方における利点が認められています。
しかし、既存の研究では、ASGD に運動量を単純に組み込むと収束が妨げられる可能性があることがわかっています。
この論文では、ASGD に対して秩序運動量 (OrMo) と呼ばれる新しい方法を提案します。
OrMo では、反復インデックスに基づいて勾配を順番に整理することで、勢いが ASGD に組み込まれます。
非凸問題に対する一定学習率と遅延適応学習率の両方による OrMo の収束を理論的に証明します。
私たちの知る限り、これは最大遅延に依存せずに運動量を伴う ASGD の収束解析を確立した最初の研究です。
経験的な結果は、OrMo が ASGD やモーメンタムのある他の非同期メソッドと比較して、より優れた収束パフォーマンスを達成できることを示しています。

要約(オリジナル)

Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (OrMo) for ASGD. In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis of ASGD with momentum without dependence on the maximum delay. Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD and other asynchronous methods with momentum.

arxiv情報

著者	Chang-Wei Shi,Yi-Rui Yang,Wu-Jun Li
発行日	2025-01-23 17:04:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ordered Momentum for Asynchronous SGD

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー