Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

要約

自己回帰 LLM 推論のメモリ帯域幅制限の性質に対抗するために、以前の研究では投機的デコードフレームワークが提案されています。
投機的デコードを実行するために、小規模なドラフトモデルは入力シーケンスの継続の候補を提案し、それが基本モデルによって並行して検証されます。
最近の Medusa デコードフレームワークで使用されているように、ドラフトモデルを指定する 1 つの方法は、ベースモデルの隠れ状態で動作する、ドラフトヘッドと呼ばれる軽量ヘッドのコレクションとして指定することです。
これまでのところ、既存のドラフトヘッドはすべて順番に独立しています。つまり、候補継続内の先行するトークンとは無関係に、候補継続中のトークンを推測します。
この研究では、推測の精度を大幅に向上させる、標準ドラフトヘッドの逐次依存型ドロップイン代替品である Hydra ヘッドを提案します。
Hydra ヘッドを使用したデコードでは、標準のドラフトヘッドを使用した Medusa デコードと比較してスループットが向上します。
私たちは、ヒドラヘッドのトレーニング目標とアーキテクチャの設計空間をさらに調査し、慎重に調整されたヒドラヘッドレシピ (Hydra++ と呼ばれます) を提案します。これにより、メデューサデコードと自己回帰デコードと比較して、デコードスループットがそれぞれ 1.31 倍と 2.71 倍向上します。
全体として、Hydra ヘッドは標準のドラフトヘッドに対する単純な介入であり、ドラフトヘッドベースの投機的デコードのエンドツーエンドの速度を大幅に向上させます。

要約(オリジナル)

To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding framework. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence, that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of light-weight heads, called draft heads, that operate on the base model’s hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads, a sequentially dependent, drop-in replacement for standard draft heads that significantly improves speculation accuracy. Decoding with Hydra heads improves throughput compared to Medusa decoding with standard draft heads. We further explore the design space of Hydra head training objectives and architectures, and propose a carefully-tuned Hydra head recipe, which we call Hydra++, that improves decoding throughput by 1.31x and 2.71x compared to Medusa decoding and autoregressive decoding, respectively. Overall, Hydra heads are a simple intervention on standard draft heads that significantly improve the end-to-end speed of draft head based speculative decoding.

arxiv情報

著者	Zachary Ankner,Rishab Parthasarathy,Aniruddha Nrusimha,Christopher Rinard,Jonathan Ragan-Kelley,William Brandon
発行日	2024-02-07 18:58:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー