To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

要約

最近の研究では、言語モデルのスケーリングにおけるデータセットのサイズの重要性が強調されています。
ただし、大規模言語モデル (LLM) は事前トレーニング中にトークンを大量に消費することで知られており、Web 上の高品質のテキストデータは LLM のスケーリング限界に近づいています。
LLM をさらに強化するための簡単なアプローチは、追加のエポックに対して事前トレーニングデータを繰り返すことです。
この研究では、このアプローチの下で 3 つの重要な側面を実証的に調査します。
まず、事前トレーニングデータを繰り返すことの結果を調査し、モデルが過学習の影響を受けやすく、マルチエポックの劣化につながることを明らかにしました。
次に、マルチエポックの劣化に寄与する主な要因を調査し、重要な要因にはデータセットのサイズ、モデルのパラメーター、トレーニング目標が含まれる一方、影響力の低い要因にはデータセットの品質とモデルの FLOP が含まれることがわかりました。
最後に、広く使用されている正則化がマルチエポック劣化を軽減できるかどうかを検討します。
ほとんどの正則化手法では、ドロップアウトを除いて大幅な改善は得られません。ドロップアウトは顕著な効果を示しますが、モデルサイズをスケールアップする際には慎重な調整が必要です。
さらに、専門家混合 (MoE) を活用することで、同等のトレーニング可能なパラメータを持つ計算集約型の高密度 LLM のコスト効率が高く効率的なハイパーパラメータ調整が可能になり、より広範な規模での効率的な LLM 開発に影響を与える可能性があることを発見しました。

要約(オリジナル)

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.

arxiv情報

著者	Fuzhao Xue,Yao Fu,Wangchunshu Zhou,Zangwei Zheng,Yang You
発行日	2023-05-22 17:02:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー