On multi-token prediction for efficient LLM inference

要約

次のトークン予測（NTP）のために事前に訓練されたLLM内のマルチトークン予測（MTP）機能を体系的に調査します。
最初に、そのようなモデルは、中間トークンの確率にわたって数値的疎外を介してMTP機能を本質的に持っていることを示しますが、パフォーマンスはデータ依存性であり、モデルスケールで改善されます。
さらに、MTPヘッドを凍結LLMに統合するという課題を調査し、隠れた層がNTPに強く特化しており、適応が自明でないことを発見します。
最後に、MTPヘッドとバックボーンの共同トレーニングがパフォーマンスを向上させる一方で、この障壁を完全に克服できず、この方向のさらなる研究を促すことを示しています。
私たちの調査結果は、前提条件のLLMSに適用されるMTPのより深い理解を提供し、並列トークン予測を介して推論を加速するための戦略を通知します。

要約(オリジナル)

We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.

arxiv情報

著者	Somesh Mehra,Javier Alonso Garcia,Lukas Mauch
発行日	2025-02-13 15:42:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On multi-token prediction for efficient LLM inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー