Grams: Gradient Descent with Adaptive Momentum Scaling for Training Large Language Models

要約

$ \ mathbf {g} $の放射降下$ \ mathbf {a} $ daptive $ \ mathbf {m} $ \ mathbf {s} $ caling（$ \ mathbf {grams} $）、斬新な最適化アルゴリットは、ディープイングレーションを想定しており、マグインをデコールします。
モメンタムをアップデートに直接統合する従来のオプティマザーとは異なり、グラムは、現在の勾配から派生した更新方向を、適応マグニチュードスケーリングのみに使用するために使用されるモメンタムから分離します。
このアプローチにより、GRAMSは最先端の慎重で勢いベースのオプティマイザーと比較して、改善された損失降下を実現できます。
理論的には、グラムが他の最先端のオプティマイザーよりも速く下降することを実証し、グラムのグローバルな収束保証を確立します。
また、広範な経験的評価を通じてその有効性を検証します。
結果は、Adam、Lion、およびその慎重なバリアントなどの広く使用されているオプティマザーと比較して、より速い収束やより良い一般化など、グラムの優れたパフォーマンスを示しています。
私たちの結果は、大規模な言語モデルを効率的にトレーニングするための変革的アプローチとしてのグラムの可能性を強調しています。
コードは$ \ href {https://github.com/gunale0926/grams} {\ text {https://github.com/gunale0926/grams}}}で入手できます。

要約(オリジナル)

We introduce $\mathbf{G}$radient Descent with $\mathbf{A}$daptive $\mathbf{M}$omentum $\mathbf{S}$caling ($\mathbf{Grams}$), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams. We also validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams’ superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams’ potential as a transformative approach for efficiently training large language models. Code is available at $\href{https://github.com/Gunale0926/Grams}{\text{https://github.com/Gunale0926/Grams}}$.

arxiv情報

著者	Yang Cao,Xiaoyu Li,Zhao Song
発行日	2025-02-28 15:31:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grams: Gradient Descent with Adaptive Momentum Scaling for Training Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー