Better & Faster Large Language Models via Multi-token Prediction

要約

GPT や Llama などの大規模な言語モデルは、次のトークンの予測損失を使用してトレーニングされます。
この研究では、複数の将来のトークンを一度に予測するように言語モデルをトレーニングすると、サンプル効率が高くなるということを提案します。
より具体的には、トレーニングコーパス内の各位置で、共有モデルトランク上で動作する n 個の独立した出力ヘッドを使用して、次の n 個のトークンを予測するようにモデルに依頼します。
マルチトークン予測を補助的なトレーニングタスクとして考慮し、コードモデルと自然言語モデルの両方についてトレーニング時間のオーバーヘッドを発生させずにダウンストリーム機能の向上を測定します。
この方法は、モデルサイズが大きくなる場合にますます便利になり、複数のエポックのトレーニングでもその魅力を維持します。
特にコーディングなどの生成ベンチマークでの向上が顕著であり、当社のモデルは一貫して強力なベースラインを数パーセント上回っています。
当社の 13B パラメーターモデルは、同等のネクストトークンモデルと比較して、HumanEval では 12 % 多く、MBPP では 17 % 多くの問題を解決します。
小規模なアルゴリズムタスクに関する実験では、マルチトークン予測が誘導ヘッドとアルゴリズム推論能力の開発に有利であることが実証されています。
追加の利点として、4 トークン予測でトレーニングされたモデルは、バッチサイズが大きい場合でも推論が最大 3 倍高速になります。

要約(オリジナル)

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

arxiv情報

著者	Fabian Gloeckle,Badr Youbi Idrissi,Baptiste Rozière,David Lopez-Paz,Gabriel Synnaeve
発行日	2024-04-30 17:33:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Better & Faster Large Language Models via Multi-token Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー