Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

要約

大規模なニューラルネットワークのトレーニングを加速しながらメモリ要件を削減する、効率的な適応最適化のための 2 つの相補的な手法を紹介します。
最初の手法である Subset-Norm 適応ステップサイズは、step- を通じて 2 番目のモーメント項のメモリフットプリントを $O(d)$ から $O(\sqrt{d})$ に削減することで、AdaGrad-Norm と AdaGrad(-Coowned) を一般化します。
サイズ共有。$d$ はモデルサイズです。
座標方向のサブガウス勾配ノイズの下での非凸の滑らかな目的に対して、既存の方法よりも改善された次元依存性を示す、ノイズに適応した高確率の収束保証を証明します。
2 番目の手法である部分空間運動量は、直交補数で標準 SGD を適用しながら低次元部分空間で動作することにより、運動量状態のメモリフットプリントを削減します。
同様の緩和された仮定の下で、確率の高い収束率を確立します。
60M から 1B パラメーターの LLaMA モデルの経験的評価は、私たちの方法の有効性を示しています。サブセットノルムと部分空間運動量を組み合わせることで、トレーニングトークンの約半分 (6.8B 対 13.1B) でアダムの検証の困惑を達成しながら、トレーニングトークンのわずか 20% を使用します。
Adam のオプティマイザーは、メモリフットプリントをステートしており、追加のハイパーパラメータ調整は最小限で済みます。

要約(オリジナル)

We introduce two complementary techniques for efficient adaptive optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm adaptive step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) by reducing the second moment term’s memory footprint from $O(d)$ to $O(\sqrt{d})$ through step-size sharing, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian gradient noise, we prove a noise-adapted high-probability convergence guarantee showing improved dimensional dependence over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state’s memory footprint by operating in a low-dimensional subspace while applying standard SGD in the orthogonal complement. We establish high-probability convergence rates under similar relaxed assumptions. Empirical evaluation on LLaMA models from 60M to 1B parameters demonstrates the effectiveness of our methods, where combining subset-norm with subspace-momentum achieves Adam’s validation perplexity in approximately half the training tokens (6.8B vs 13.1B) while using only 20% of the Adam’s optimizer-states memory footprint and requiring minimal additional hyperparameter tuning.

arxiv情報

著者	Thien Hang Nguyen,Huy Le Nguyen
発行日	2024-11-11 16:48:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー