Establishing Task Scaling Laws via Compute-Efficient Model Ladders

要約

私たちは、オーバートレーニング設定における事前トレーニング済み言語モデル (LM) の個々のタスクのパフォーマンスを予測するためのタスクスケーリング則とモデルラダーを開発します。
言語モデリング損失の標準的なべき乗則は、タスクのパフォーマンスを正確にモデル化できません。
したがって、2 段階の予測アプローチを活用します。最初にモデルとデータサイズを使用してタスク固有の損失を予測し、次にこのタスク損失を使用してタスクのパフォーマンスを予測します。
一連の小規模な「ラダー」モデルをトレーニングし、2 つの予測ステップのパラメーター化された関数に適合するデータポイントを収集し、2 つのターゲットモデル (4T トークンにトレーニングされた 7B モデルと 5T トークンにトレーニングされた 13B モデル) の予測を行います。
。
ラダーモデルのトレーニングにかかるコストは、ターゲットモデルに使用されるコンピューティングの 1% のみです。
ランク付けされた分類形式で記述された 4 つの多肢選択タスクでは、両方のターゲットモデルの精度を絶対誤差 2 ポイント以内で予測できます。
他の 4 つのタスクでは予測誤差が高く (平均絶対誤差 6.9)、これらはタスクメトリクスの分散が大きいタスクであることが多いことがわかります。
また、少ない数のラダーモデルをトレーニングするために使用するコンピューティングが少ないと、予測が悪化する傾向があることもわかりました。
最後に、設計の選択と 2 段階のアプローチにより、スケーリング則の確立において優れたパフォーマンスが得られることを経験的に示します。

要約(オリジナル)

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale ‘ladder’ models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.

arxiv情報

著者	Akshita Bhagia,Jiacheng Liu,Alexander Wettig,David Heineman,Oyvind Tafjord,Ananya Harsh Jha,Luca Soldaini,Noah A. Smith,Dirk Groeneveld,Pang Wei Koh,Jesse Dodge,Hannaneh Hajishirzi
発行日	2024-12-05 18:21:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー