Rho-1: Not All Tokens Are What You Need

要約

以前の言語モデルの事前トレーニング方法では、次のトークンの予測損失がすべてのトレーニングトークンに均一に適用されていました。
この基準に異議を唱え、「コーパス内のすべてのトークンが言語モデルのトレーニングにとって同じように重要であるわけではない」と仮定します。
私たちの最初の分析では、言語モデルのトークンレベルのトレーニングダイナミクスを詳しく調べ、さまざまなトークンの明確な損失パターンを明らかにしました。
これらの洞察を活用して、Rho-1 と呼ばれる新しい言語モデルを導入します。
コーパス内の次のすべてのトークンを予測することを学習する従来の LM とは異なり、Rho-1 は選択的言語モデリング (SLM) を採用しており、目的の分布に一致する有用なトークンを選択的にトレーニングします。
このアプローチには、参照モデルを使用して事前トレーニングトークンをスコアリングし、その後、超過損失が高いトークンに焦点を当てて言語モデルをトレーニングすることが含まれます。
15B OpenWebMath コーパスで継続的に事前トレーニングを行うと、Rho-1 は 9 つの数学タスクで数回の精度が最大 30% 向上します。
微調整後、Rho-1-1B と 7B は、MATH データセットでそれぞれ 40.6% と 51.8% という最先端の結果を達成しました。これは、わずか 3% の事前トレーニングトークンで DeepSeekMath に匹敵します。
さらに、80B の一般トークンで事前トレーニングを行う場合、Rho-1 は 15 の多様なタスクにわたって平均 6.8% の強化を達成し、言語モデルの事前トレーニングの効率とパフォーマンスの両方を向上させます。

要約(オリジナル)

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that ‘Not all tokens in a corpus are equally important for language model training’. Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively – matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

arxiv情報

著者	Zhenghao Lin,Zhibin Gou,Yeyun Gong,Xiao Liu,Yelong Shen,Ruochen Xu,Chen Lin,Yujiu Yang,Jian Jiao,Nan Duan,Weizhu Chen
発行日	2024-04-11 17:52:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rho-1: Not All Tokens Are What You Need

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー