Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

要約

ミニバッチ確率的勾配降下法 (SGD) のパフォーマンスは、ディープニューラルネットワークのトレーニングにおける経験的損失を最小限に抑えるためのバッチサイズと学習率の設定に大きく依存します。
このペーパーでは、4 つのスケジューラーを使用したミニバッチ SGD の理論的分析を示します。(i) 一定のバッチサイズと減衰学習率スケジューラー、(ii) 増加するバッチサイズと減衰学習率スケジューラー、(iii) 増加するバッチサイズと増加する学習
レートスケジューラ、および (iv) バッチサイズの増加とウォームアップ減衰学習レートスケジューラ。
スケジューラ (i) を使用したミニバッチ SGD は、経験的損失の完全な勾配ノルムの期待を常に最小化しないが、スケジューラ (ii)、(iii)、および (iv) のいずれかを使用した場合は最小化することを示します。
さらに、スケジューラ (iii) と (iv) はミニバッチ SGD を高速化します。
また、この論文では、スケジューラー (iii) または (iv) を使用すると、スケジューラー (i) または (ii) を使用するよりも早く経験的損失の全勾配ノルムが最小化されることを示す裏付け分析の数値結果も提供されます。

要約(オリジナル)

The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

arxiv情報

著者	Hikaru Umeda,Hideaki Iiduka
発行日	2024-09-13 12:24:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー