Accelerating Speculative Decoding using Dynamic Speculation Length

要約

投機的デコードは、大規模な言語モデルの推論待ち時間を短縮するための有望な方法です。
この方法の有効性は、投機長 (SL)、つまり各反復でドラフトモデルによって生成されるトークンの数によって決まります。
投機的デコード手法の大部分は、すべての反復で同じ SL を使用します。
この研究では、この実践が最適ではないことを示します。
DISCO を導入します。これは、分類子を使用して反復ごとに SL を動的に調整しながら、デコード品質を確実に維持する動的仕様長最適化手法です。
4 つのベンチマークを使用した実験では、最良のベースラインと比較して平均 10.3% の速度向上が実証されました。

要約(オリジナル)

Speculative decoding is a promising method for reducing the inference latency of large language models. The effectiveness of the method depends on the speculation length (SL) – the number of tokens generated by the draft model at each iteration. The vast majority of speculative decoding approaches use the same SL for all iterations. In this work, we show that this practice is suboptimal. We introduce DISCO, a DynamIc SpeCulation length Optimization method that uses a classifier to dynamically adjust the SL at each iteration, while provably preserving the decoding quality. Experiments with four benchmarks demonstrate average speedup gains of 10.3% relative to our best baselines.

arxiv情報

著者	Jonathan Mamou,Oren Pereg,Daniel Korat,Moshe Berchansky,Nadav Timor,Moshe Wasserblat,Roy Schwartz
発行日	2024-05-07 13:27:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerating Speculative Decoding using Dynamic Speculation Length

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー