Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

要約

投機的デコーディング（SD）は、候補トークンを受け入れる際に、ターゲットモデルとの厳密な分布等価性を強制する。これはターゲットモデルの生成品質を維持する一方で、この厳密な等価性によりSDで達成可能な高速化が制限され、ユーザが推論速度のさらなる向上と引き換えにターゲット分布からの逸脱を交換することを妨げる。これらの限界に対処するために、我々はファジィ投機的復号化（FSD）を導入する。これは、ターゲットとドラフトモデルの分布間の乖離に基づいて候補トークンを受け入れることにより、SDを一般化する復号化アルゴリズムである。ターゲットモデルからの乖離を制御できるようにすることで、FSDは生成品質と推論速度を柔軟に交換することを可能にする。いくつかのベンチマークにおいて、我々の手法はSDよりも1秒あたり5トークン以上高速でありながら、ベンチマークの精度を約2%しか低下させないという、大幅な実行時間の改善を達成することができた。多くの場合、FSDは1秒あたり2トークン以上速くSDのベンチマーク精度に匹敵することさえでき、目標モデルの性能を維持するために分布の等価性が必要ないことを実証しています。さらに、FSDは既存のSD拡張機能にシームレスに統合することができます。FSDをEAGLE-2に適用することで、この既存の拡張機能の効率が大幅に向上し、FSDの調整可能な品質と速度のトレードオフを活用できるようになることを実証します。

要約(オリジナル)

Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model’s generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) – a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension’s efficiency while allowing it to leverage FSD’s tunable quality-speed trade-off.

arxiv情報

著者	Maximilian Holsman,Yukun Huang,Bhuwan Dhingra
発行日	2025-06-03 16:08:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー