SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

要約

投機的デコードを通じて大規模言語モデル (LLM) 推論を高速化する新しいモデルフリーのアプローチである SuffixDecoding を紹介します。
ドラフトモデルや特殊なデコードヘッドに依存する既存の方法とは異なり、SuffixDecoding は、以前に生成された出力から構築されたサフィックスツリーを活用して、候補トークンシーケンスを効率的に予測します。
私たちのアプローチは、追加モデルの保守や調整のオーバーヘッドなしで、柔軟なツリー構造の推測を可能にします。
SuffixDecoding は、サフィックスツリーを構築および動的に更新して、生成されたテキスト内のパターンをキャプチャし、それらを使用して、経験的なトークン頻度に基づく原則に基づいたスコアリングメカニズムを通じて推測ツリーを構築します。
SuffixDecoding に必要なのは CPU メモリのみですが、一般的な LLM サービングノードでは十分に活用されていません。
私たちは、SuffixDecoding が、オープンドメインチャット、コード生成、テキストから SQL へのタスクなどのさまざまなワークロードにわたって、モデルベースのアプローチと比較して競争力のある高速化を実現することを実証します。
オープンエンドのチャットおよびコード生成タスクの場合、SuffixDecoding は SpecInfer よりも最大 $1.4\times$ 高い出力スループットと、最大 $1.1\times$ 低いトークンあたりの時間 (TPOT) レイテンシを実現します。
独自のマルチ LLM テキストから SQL へのアプリケーションの場合、SuffixDecoding は投機的デコードよりも最大 $2.9\times$ 高い出力スループットと $3\times$ 低いレイテンシを実現します。
私たちの評価では、SuffixDecoding は 256 例という小規模な参照コーパスでも高い受け入れ率を維持し、より多くの歴史的な出力が組み込まれるにつれてパフォーマンスが向上し続けることがわかりました。

要約(オリジナル)

We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4\times$ higher output throughput than SpecInfer and up to $1.1\times$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9\times$ higher output throughput and $3\times$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.

arxiv情報

著者	Gabriele Oliaro,Zhihao Jia,Daniel Campos,Aurick Qiao
発行日	2024-11-07 18:49:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー