Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

要約

大規模な言語モデル（LLM）は、多数の現実世界のタスクで優れたパフォーマンスを示しています。
ただし、これらのモデルの自己回帰の性質により、推論プロセスが遅くコストがかかります。
投機的デコードは有望なソリューションとして浮上し、より小さな補助モデルを活用して将来のトークンをドラフトし、その後、より大きなモデルによって同時に検証され、1〜2倍のスピードアップを達成します。
投機的デコードは多項サンプリングと同じ分布と一致しますが、多項サンプリング自体は最適ではない出力になりやすいのに対し、ビームサンプリングは、各ステップで複数の候補シーケンスを維持することにより高品質の結果を生成するために広く認識されています。
このペーパーでは、投機的デコードとビームサンプリングの新しい統合について説明します。
ただし、4つの重要な課題があります。（1）小さなモデルからのドラフトシーケンスを与えられたより大きなモデルの分布から複数のシーケンスを生成する方法。
（2）効率と精度のバランスをとるために、ビームの数を動的に最適化する方法。
（3）複数のドラフトを並行して効率的に検証する方法。
（4）ビームサンプリングに固有の余分なメモリコストに対処する方法。
これらの課題に対処するために、動的幅の投機的ビームデコード（DSBD）を提案します。
具体的には、最初に、小さなモデルからのビームサンプリング軌跡に基づいて、大規模モデルの分布に続いて複数のシーケンスを生成する新しいドラフトと検証スキームを導入します。
次に、コンテキストに基づいてビームの数を動的に調整し、効率と有効性を最適化する適応メカニズムを導入します。
その上、ツリーベースの並列検証を拡張して、複数のツリーを同時に処理し、検証プロセスを加速します。
最後に、アルゴリズムの簡単な変更を示して、ビームサンプリングのメモリオーバーヘッドを緩和します…

要約(オリジナル)

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model’s distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model’s distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling…

arxiv情報

著者	Zongyue Qin,Zifan He,Neha Prakriya,Jason Cong,Yizhou Sun
発行日	2025-03-14 16:18:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー