Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

要約

ラベルのないデータが増え続ける利用可能性は、人工知能システムをトレーニングする機会と課題の両方を示しています。
自己学習学習（SSL）は、膨大な量の非標識データから意味のある表現を抽出するための強力なパラダイムとして浮上していますが、既存の方法は、以前に学習された知識を忘れることなく、実際のデータストリームの非定常的で非IID性質に適応するのに苦労しています。
最近の作品は、大規模な継続的なトレーニングのために繰り返しコサインアニーリングスケジュールを採用しています。
ただし、これらのスケジュール（1）は、再利用段階で本質的に忘れを引き起こし、（2）既存の連続SSLメソッドと体系的に比較されていません。
この作業では、広く使用されているコサインスケジュールを最近提案されている無限の学習率スケジュールと体系的に比較し、後者がより効果的な代替手段であると経験的に発見します。
多様な画像および言語データセットにわたる当社の広範な経験的評価は、無限の学習率スケジュールが、固定反復予算に制限されることなく、繰り返されるコサイン減衰と比較して、継続的なトレーニング前のパフォーマンスを一貫して強化することを示しています。
たとえば、小規模のMAE事前トレーニングセットアップでは、文献からいくつかの強力なベースラインよりも優れています。
次に、実験をより大きなMAE前訓練および自己回帰言語モデルのトレーニング前に拡大します。
我々の結果は、無限の学習率スケジュールが大規模に効果的なままであり、MAE前訓練とゼロショットLMベンチマークの両方でコサイン減衰を繰り返したことを超えていることを示しています。

要約(オリジナル)

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

arxiv情報

著者	Paul Janson,Vaibhav Singh,Paria Mehrbod,Adam Ibrahim,Irina Rish,Eugene Belilovsky,Benjamin Thérien
発行日	2025-03-04 18:15:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー