Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

要約

投機的デコードは、大規模な言語モデルの推論待ち時間を短縮するために一般的に使用されます。
その有効性は、投機先読み (SL)、つまり各反復でドラフトモデルによって生成されるトークンの数に大きく依存します。
この研究では、すべての反復で同じ SL を使用する一般的な方法 (静的 SL) が次善であることを示します。
SL を動的に選択するための新しい方法である DISCO (DynamIc SpeCulation lookahead Optimization) を紹介します。
4 つのデータセットを使用した実験では、DISCO はまったく同じテキストを生成しながら、最良の静的 SL ベースラインと比較して平均 10% の高速化に達することがわかりました。

要約(オリジナル)

Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.

arxiv情報

著者	Jonathan Mamou,Oren Pereg,Daniel Korat,Moshe Berchansky,Nadav Timor,Moshe Wasserblat,Roy Schwartz
発行日	2024-11-07 12:59:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー