PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

要約

最近では、コンピュータークラスター全体にわたる大規模言語モデル (LLM) の推論が研究の焦点となっており、多くの高速化技術が CPU の投機的実行からインスピレーションを得ています。
これらの手法では、メモリ帯域幅に関連するボトルネックが軽減されますが、推論実行ごとのエンドツーエンドのレイテンシも増加するため、パフォーマンスを向上させるためには高い推測受け入れ率が必要になります。
タスク間で異なる受け入れ率と組み合わせると、投機的推論手法によりパフォーマンスが低下する可能性があります。
さらに、パイプライン並列設計では、最大限の使用率を維持するために多くのユーザー要求が必要です。
解決策として、私たちはパイプライン化された投機的加速技術である PipeInfer を提案します。これは、トークン間のレイテンシーを削減し、単一リクエストのシナリオでのシステム使用率を向上させると同時に、低い投機受け入れ率と低帯域幅の相互接続に対する耐性も向上させます。
PipeInfer は、標準の投機的推論と比較して生成速度が最大 2.15$\times$ 向上します。
PipeInfer は、連続非同期投機と早期推論キャンセルによって改善を実現します。前者は、複数の投機的実行と同時に単一トークンの推論を実行することでレイテンシーと生成速度を向上させます。一方、後者は、たとえ
推理の途中。

要約(オリジナル)

Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15$\times$ improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference.

arxiv情報

著者	Branden Butler,Sixing Yu,Arya Mazaheri,Ali Jannesari
発行日	2024-07-16 14:52:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー