CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

要約

大規模言語モデル (LLM) は、さまざまなタスクに優れていますが、ドメイン固有または独自のコーパスが限られているため、特殊な分野ではパフォーマンスが低下することがよくあります。
継続的事前トレーニング (CPT) は、一般的なコーパスを再生しながら、致命的な忘却を防止しながら、新しいドメイン固有または独自の知識を注入することで LLM 機能を強化します。
ただし、一般的なコーパスとドメイン固有のコーパスのデータ混合比率はヒューリスティックに基づいて選択されているため、実際のトレーニング効率は最適とは言えません。
これに関連して、CPT の内部で LLM のスケーリング動作を再検討し、損失、混合比、トレーニングトークンのスケールの間のべき乗則の関係を発見しようとします。
一般的な機能とドメイン固有の機能の間のトレードオフを形式化し、一般的なデータとドメインデータの明確に定義された重要混合比 (CMR) を導き出します。
CMR はバランスを取ることで、モデルの一般的な機能を維持し、目的のドメイン転送を実現し、利用可能なリソースを最大限に活用します。
したがって、効率と有効性のバランスを重視する場合、CMRが最適な混合比と考えられます。私たちは、豊富な実験を通じてCMRの予測可能性を確認し、CMRスケーリング則を提案し、その一般性を実証しました。
これらの調査結果は、特殊なドメインでの LLM トレーニングを最適化し、トレーニングリソースを効率的に管理しながら、一般的なパフォーマンスとドメイン固有のパフォーマンスの両方を確保するための実践的なガイドラインを提供します。

要約(オリジナル)

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model’s general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Therefore, if we value the balance between efficiency and effectiveness, CMR can be consider as the optimal mixture ratio.Through extensive experiments, we ascertain the predictability of CMR, and propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

arxiv情報

著者	Jiawei Gu,Zacc Yang,Chuanghao Ding,Rui Zhao,Fei Tan
発行日	2024-07-24 17:59:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー