この原則を運動量ベースのオプティマイザー (運動量と Adam を使用した SGD) に組み込むと、収束が高速化される (平均して少なくとも 15% のステップが節約される) ことを示します。
Overshoot は、幅広いタスクにわたって標準と Nesterov のモメンタムの両方を常に上回っており、メモリがゼロで計算オーバーヘッドが小さい一般的なモメンタムベースのオプティマイザに統合されます。
Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov’s momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot calculates the gradient at model weights shifted in the direction of the current momentum. This sacrifices the immediate benefit of using the gradient w.r.t. the exact model weights now, in favor of evaluating at a point, which will likely be more relevant for future updates. We show that incorporating this principle into momentum-based optimizers (SGD with momentum and Adam) results in faster convergence (saving on average at least 15% of steps). Overshoot consistently outperforms both standard and Nesterov’s momentum across a wide range of tasks and integrates into popular momentum-based optimizers with zero memory and small computational overhead.
著者 | Jakub Kopal,Michal Gregor,Santiago de Leon-Martinez,Jakub Simko |
発行日 | 2025-01-16 14:18:10+00:00 |
arxivサイト | arxiv_id(pdf) |
提供元, 利用サービス
arxiv.jp, Google