SPEED: Speculative Pipelined Execution for Efficient Decoding

要約

Transformer アーキテクチャに基づく生成大規模言語モデル (LLM) は、最近、幅広い自然言語処理タスクの主要な基盤モデルとして浮上しています。
それにも関わらず、これらのモデルに関連する推論遅延が大きいため、リアルタイムシナリオでの適用は非常に制限されています。
これは、各トークンが以前のすべての出力トークンに依存するため、トークンが順番に生成される生成 LLM 推論の自己回帰の性質により特に顕著です。
したがって、トークンレベルの並列処理を達成することは困難であり、推論は非常にメモリに依存します。
この研究では、初期層の隠れ状態に基づく予測値を使用して、現在のトークンと並行して複数の将来のトークンを投機的に実行することで推論効率を向上させる SPEED を提案します。
パラメータ共有を採用する Transformer デコーダの場合、並列実行されるトークンのメモリ操作を償却できるため、生成 LLM 推論を高速化できます。
モデルの精度と比較したレイテンシの削減という観点からこの方法の効率を実証し、実行時のオーバーヘッドを最小限に抑えながら、推測によってパラメーター共有を使用してより深いデコーダーをトレーニングできることを示します。

要約(オリジナル)

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.

arxiv情報

著者	Coleman Hooper,Sehoon Kim,Hiva Mohammadzadeh,Hasan Genc,Kurt Keutzer,Amir Gholami,Sophia Shao
発行日	2023-10-18 16:07:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SPEED: Speculative Pipelined Execution for Efficient Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー