Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

要約

Adamなどの適応勾配最適化方法は、より速い収束を達成する能力により、多様な機械学習タスク全体で深いニューラルネットワークをトレーニングするのに普及しています。
ただし、これらの方法は、特にトレーニングトランスモデルの場合、確率勾配降下（SGD）と比較して、最適ではない一般化に悩まされることがよくあります。
この作業では、これらの制限に寄与する重要な要因として、2次モーメント推定（$ v_0 = 0 $）の標準的な初期化を示します。
シンプルで効果的なソリューションを紹介します。データ駆動型またはランダムな初期化戦略を使用して、ゼロ以外の値で2次モーメント推定を初期化します。
経験的評価は、私たちのアプローチが収束を安定させるだけでなく、適応勾配オプティマイザーの最終的なパフォーマンスを向上させることを示しています。
さらに、提案された初期化戦略を採用することにより、Adamは、適応勾配最適化方法の最近提案された多くのバリエーションに匹敵するパフォーマンスを達成します。
私たちのコードは、https：//github.com/walleclipse/adam_initializationで入手できます。

要約(オリジナル)

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ($v_0 =0$) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the proposed initialization strategies, Adam achieves performance comparable to many recently proposed variants of adaptive gradient optimization methods. Our code is available at https://github.com/Walleclipse/Adam_Initialization.

arxiv情報

著者	Abulikemu Abuduweili,Changliu Liu
発行日	2025-02-11 16:23:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー