Cascade Speculative Drafting for Even Faster LLM Inference

要約

投機的デコードは、ドラフトモデルを利用してレビューする大規模なターゲットモデルのドラフトを作成することにより、大規模言語モデル (LLM) の効率を高めます。
ただし、投機的デコードでのドラフトには、遅い自己回帰生成と、同じ時間割り当てで異なる重要性のトークンの生成が含まれます。
これら 2 つの非効率性により、最適なパフォーマンスが得られません。
この問題に対処するために、2 種類のカスケードを使用する新しいアプローチである Cascade Speculative Drafting (CS. Drafting) を導入します。
垂直カスケードは、ニューラルモデルからの自己回帰生成を排除します。
水平カスケードは、理論的分析によって裏付けられた最適性により、製図における効率的な時間配分を構成します。
両方のカスケードを組み合わせたものが CS です。
ドラフティングアルゴリズムは、同じ出力分布を維持しながら、実験で投機的デコードよりも最大 72 パーセントのさらなる高速化を達成しました。

要約(オリジナル)

Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.

arxiv情報

著者	Ziyi Chen,Xiaocong Yang,Jiacheng Lin,Chenkai Sun,Jie Huang,Kevin Chen-Chuan Chang
発行日	2023-12-21 18:46:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cascade Speculative Drafting for Even Faster LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー