Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

要約

言語モデルの事前トレーニングに莫大なコストがかかることを考慮すると、最適化アルゴリズムの重要な改善は、トレーニングの時間とコストの大幅な削減につながるでしょう。
Adam とその亜種は何年もの間最先端であり、より洗練された 2 次 (ヘッセ行列ベース) オプティマイザーでは、ステップごとのオーバーヘッドが多すぎることがよくあります。
この論文では、事前調整器として対角ヘッセ行列の軽量推定を使用する、シンプルでスケーラブルな 2 次オプティマイザである Sophia、2 次クリップ確率的最適化を提案します。
更新は、勾配の移動平均を推定ヘッセ行列の移動平均で割った後、要素ごとにクリッピングされます。
クリッピングは最悪の場合の更新サイズを制御し、軌道に沿ったヘッセ行列の非凸性と急速な変化による悪影響を抑制します。
Sophia は、数回の反復ごとに対角ヘッセ行列を推定するだけであり、ステップごとの平均時間とメモリオーバーヘッドは無視できます。
125M から 770M までのサイズの GPT-2 モデルを使用した言語モデリングでは、Sophia は、ステップ数、合計計算時間、および実時間の点で Adam と比較して 2 倍の高速化を達成します。
理論的には、Sophia がパラメーターのさまざまなコンポーネントの曲率に適応することを示します。この曲率は、言語モデリングタスクでは非常に不均一になる可能性があります。
実行時間の限界は損失の条件数には依存しません。

要約(オリジナル)

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.

arxiv情報

著者	Hong Liu,Zhiyuan Li,David Hall,Percy Liang,Tengyu Ma
発行日	2023-05-23 17:59:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー