Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

要約

大規模な言語モデルの推論速度を高めるために、最小限のドラフトのレイテンシと高い投機精度との間の最適なバランスをとることは、投機的デコードにおける重要な課題のままです。
このペーパーでは、ドラフトの並列性と出力品質の両方を増強するために作られた革新的な半自動的投機的デコードフレームワークであるFalconを紹介します。
Falconには、同じブロック内のトークン間依存関係を強化する結合された連続視線蒸留技術が組み込まれており、推測の精度が向上します。
基礎となるメカニズムを照らすために、包括的な理論分析を提供します。
さらに、カスタム設計のデコードツリーを導入します。これにより、ドラフトが単一のフォワードパスで複数のトークンを生成し、必要に応じて複数のフォワードパスに対応できるため、ドラフトされたトークンの数を増やし、全体的な受け入れ率を大幅に改善します。
MTベンチ、Humanval、GSM8Kなどのベンチマークデータセットに関する包括的な評価は、Falconの優れた加速能力を示しています。
このフレームワークは、VicunaおよびLlama2-Chatモデルシリーズでテストされた場合、2.91xから3.51xの範囲のロスレススピードアップ比を達成します。
これらの結果は、イーグル、メデューサ、Lookahead、SPS、PLDなど、LLMの既存の投機的解読方法を上回り、2つの変圧器層に相当するコンパクトな草案アーキテクチャを維持します。

要約(オリジナル)

Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter’s parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and significantly improving the overall acceptance rate. Comprehensive evaluations on benchmark datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon’s superior acceleration capabilities. The framework achieves a lossless speedup ratio ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model series. These results outstrip existing speculative decoding methods for LLMs, including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact drafter architecture equivalent to merely two Transformer layers.

arxiv情報

著者	Xiangxiang Gao,Weisheng Xie,Yiwei Xiang,Feng Ji
発行日	2025-04-22 07:32:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー