A Quadratic Synchronization Rule for Distributed Deep Learning

要約

データ並列処理を使用した分散ディープラーニングでは、特に多くのノードが連携して大規模なモデルをトレーニングする場合、各トレーニングステップで勾配を同期すると、膨大な通信オーバーヘッドが発生する可能性があります。
Local SGD などのローカル勾配メソッドは、ワーカーが他のワーカーと同期せずに $H$ ステップをローカルで計算できるようにすることで、この問題に対処し、通信頻度を削減します。
$H$ は最適化の効率を通信コストと引き換えにするハイパーパラメータとみなされてきましたが、最近の研究では、適切な $H$ 値を設定することで一般化の改善につながる可能性があることが示されています。
しかし、適切な $H$ を選択するのは困難です。
この研究では、二次同期規則 (QSR) と呼ばれる $H$ を決定するための理論に基づいた方法を提案しています。これは、学習率として $\frac{1}{\eta^2}$ に比例して $H$ を動的に設定することを推奨しています。
$\eta$ は時間の経過とともに減衰します。
ResNet と ViT に関する広範な ImageNet 実験では、QSR を使用したローカル勾配法が他の同期戦略よりもテスト精度を一貫して向上させることが示されています。
標準のデータ並列トレーニングと比較して、QSR を使用すると、ViT-B 上のローカル AdamW は 16 または 64 GPU でのトレーニング時間を 26.7 時間から 20.2 時間、または 8.6 時間から 5.5 時間に短縮でき、同時に $1.16\%$ を達成できます。
または $0.84\%$ 高いトップ 1 検証精度。

要約(オリジナル)

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.

arxiv情報

著者	Xinran Gu,Kaifeng Lyu,Sanjeev Arora,Jingzhao Zhang,Longbo Huang
発行日	2024-04-12 13:59:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Quadratic Synchronization Rule for Distributed Deep Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー