EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

要約

最新のLLMの順次性質により、それらは高価で遅くなり、投機的なサンプリングはこの問題の効果的な解決策であることが証明されています。
イーグルなどの方法は、機能レベルで自己網目上を実行し、ターゲットモデルのトップレイヤー機能を再利用して、バニラの投機的サンプリングよりも良い結果を達成します。
LLMコミュニティの成長傾向は、推論コストを増やすことなくモデルインテリジェンスを改善するためにトレーニングデータを拡大することです。
ただし、データをスケーリングすることで、イーグルの改善が限られていることがわかります。
この制限は、Eagleの機能予測の制約から生じることを特定します。
このホワイトペーパーでは、Eagle-3を紹介します。これは、トレーニング時間テストという名前のテクニックを介して、直接トークン予測を支持してフィーチャーフィーチャー予測を導入し、トップレイヤー機能への依存をマルチレイヤー機能融合に置き換えます。
これらの改善により、パフォーマンスが大幅に向上し、ドラフトモデルがトレーニングデータの拡大から完全に恩恵を受けることができます。
実験には、5つのタスクで評価されたチャットモデルと推論モデルの両方が含まれます。
結果は、Eagle-3が最大6.5倍までのスピードアップ比を達成し、Eagle-2よりも約1.4倍改善したことを示しています。
Sglangフレームワークでは、Eagle-3は64のバッチサイズで1.38倍のスループット改善を達成します。コードはhttps://github.com/safeailab/eagleで入手できます。

要約(オリジナル)

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE’s feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.

arxiv情報

著者	Yuhui Li,Fangyun Wei,Chao Zhang,Hongyang Zhang
発行日	2025-04-23 07:08:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー