Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

要約

Adam(W) のような適応勾配オプティマイザーは、トランスフォーマーなどの多くの深層学習アーキテクチャのデフォルトのトレーニングアルゴリズムです。
それらの対角プレコンディショナーは、平方根を介してパラメーター更新に組み込まれる勾配外積に基づいています。
これらの手法は近似二次手法として動機づけられることが多いですが、平方根は根本的な違いを表します。
この研究では、ルートを削除したとき、つまり二次動機を強化したときに、適応手法の動作がどのように変化するかを調査します。
驚くべきことに、このような平方根フリーの適応手法は、変換器でのルートベースの対応する手法のパフォーマンスを維持しながら、畳み込みアーキテクチャでの SGD との一般化ギャップを縮めることがわかりました。
2 次パースペクティブには、前処理条件の不変性の概念を通じて任意の曲率近似を組み込むことができる非対角法の開発にも実用的な利点があります。
シャンプーなどのルートベースのメソッドとは対照的に、ルートフリーの対応物は、数値的に不安定な行列ルートの分解や逆変換を必要としないため、半精度で適切かつ高速に動作します。
全体として、私たちの調査結果は、適応手法の開発に関する新たな洞察を提供し、適応手法の成功における見落とされてきた適応性の役割に関する重要な疑問を提起します。
(実験コード: https://github.com/yorkerlin/remove-the-square-root オプティマイザーコード: https://github.com/f-dangel/sirfshampoo)

要約(オリジナル)

Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e., strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart’s performance on transformers. The second-order perspective also has practical benefits for developing non-diagonal methods that can incorporate arbitrary curvature approximations through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, root-free counterparts work well and fast with half-precision since they do not require numerically unstable matrix root decompositions and inversions. Overall, our findings provide new insights into the development of adaptive methods and raise important questions regarding the overlooked role of adaptivity in their success. (experiment code: https://github.com/yorkerlin/remove-the-square-root optimizer code: https://github.com/f-dangel/sirfshampoo)

arxiv情報

著者	Wu Lin,Felix Dangel,Runa Eschenhagen,Juhan Bae,Richard E. Turner,Alireza Makhzani
発行日	2024-08-30 16:45:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー