Avoiding spurious sharpness minimization broadens applicability of SAM

要約

Sharpness Aware Minimization (SAM)のような曲率正則化技術は、視覚タスクの汎化を改善する上で非常に有望である。しかし、自然言語処理(NLP)のような領域では、SAMの性能は低く、しばしば性能が低下することが分かっている。我々は、ドメイン間の矛盾を調査し、NLPの設定では、SAMは、関数自体の形状を改善するのではなく、ロジット統計量の正則化に支配されていることを発見した。これは、ニューラルネットワークによって実装される関数全体の統計量の修正によってのみ曲率を正則化し、ロジット操作による偽の最小化を回避するものである。さらに、SAM摂動を事前条件付けすることでもスプリアス最小化を防ぐことができ、Functional-SAMと組み合わせることで、さらなる改善が得られると主張する。我々の提案するアルゴリズムは、様々なモデルスケール（10億パラメータスケールを含む）において、固定長およびチンチラスタイルの学習設定の両方において、同じステップ数で学習した場合、AdamWおよびSAMのベースラインよりも改善された性能を示す。全体として、本研究は、大規模言語モデル（LLM）への曲率正則化の適用範囲を広げるために、より正確なシャープネスの特徴の重要性を強調する。

要約(オリジナル)

Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance — even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics — instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional-SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs).

arxiv情報

著者	Sidak Pal Singh,Hossein Mobahi,Atish Agarwala,Yann Dauphin
発行日	2025-02-04 15:25:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Avoiding spurious sharpness minimization broadens applicability of SAM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー