Adam-mini: Use Fewer Learning Rates To Gain More

要約

私たちは、Adam-mini を提案します。Adam-mini は、メモリ使用量を 45% ～ 50% 削減しながら、AdamW と同等以上のパフォーマンスを実現するオプティマイザです。
Adam-mini は、Adam の学習率リソース (つまり、$1/\sqrt{v}$) を削減することでメモリを削減します。
$v$ のこれらの学習率の $\geq$ の 90% は、(1) ヘッセ行列構造に関して提案した原則に従ってパラメータを慎重にブロックに分割すれば、無害に削除できることがわかります。
(2) 単一だが良好な学習率を各パラメータブロックに割り当てます。
さらに、これらのパラメーターブロックのそれぞれについて、それを検索するのに十分なリソースが利用可能であれば、Adam を上回る高品質の学習率が 1 つ存在することがわかりました。
次に、優れた学習率を見つけるための費用対効果の高い方法を 1 つ提供し、Adam-mini を提案します。
経験的に、Adam-mini は、事前トレーニング、教師付き微調整、および RLHF について、125M から 7B までのサイズのさまざまな言語モデルで AdamW と同等以上のパフォーマンスを発揮することを確認しています。
Adam-mini のメモリ使用量の削減により、GPU と CPU 間の通信オーバーヘッドも軽減され、スループットが向上します。
たとえば、$2\times$ A800-80GB GPU で Llama2-7B を事前トレーニングする場合、Adam-mini は AdamW より 49.6% 高いスループットを達成し、事前トレーニングの実時間を 33% 節約します。

要約(オリジナル)

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq$ 90% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

arxiv情報

著者	Yushun Zhang,Congliang Chen,Ziniu Li,Tian Ding,Chenwei Wu,Yinyu Ye,Zhi-Quan Luo,Ruoyu Sun
発行日	2024-06-26 13:03:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adam-mini: Use Fewer Learning Rates To Gain More

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー