POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

要約

LLM 推論の各リクエストは、コンピューティングバウンドのプリフィルとメモリ帯域幅のバウンドデコードという 2 つのフェーズを通過します。
GPU 使用率を向上させるために、最近のシステムでは、異なるリクエストのプリフィルフェーズとデコードフェーズを同じバッチに結合するハイブリッドバッチ処理が使用されています。
ハイブリッドバッチ処理は、HBM からモデルの重みを読み込むコストを償却するため、線形操作に適しています。
ただし、既存のアテンションカーネルはプレフィルまたはデコード用に最適化されているため、ハイブリッドバッチでのアテンションの計算は依然として非効率的です。
このペーパーでは、ハイブリッドバッチのアテンションを効率的に計算する最初の GPU カーネルである POD-Attendance について説明します。
POD-Attention は、プリフィル操作とデコード操作が同じマルチプロセッサ上で同時に発生するように GPU のリソースを慎重に割り当てることで、コンピューティング帯域幅とメモリ帯域幅の両方の利用率を最大化することを目的としています。
POD-Attend を最先端の LLM 推論スケジューラー Sarathi-Serve に統合します。
POD-Attend は、オフライン推論におけるアテンションの計算を最大 75% (平均 28%) 高速化し、LLM サービングのスループットを最大 22% 向上させます。
オンライン推論では、POD-Attention を使用すると、Sarathi-Serve と比較して、最初のトークンまでの時間 (TTFT)、トークン間の時間 (TBT)、およびリクエストの実行遅延を短縮できます。

要約(オリジナル)

Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. Hybrid batching works well for linear operations as it amortizes the cost of loading model weights from HBM. However, attention computation in hybrid batches remains inefficient because existing attention kernels are optimized for either prefill or decode. In this paper, we present POD-Attention — the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU’s resources such that prefill and decode operations happen concurrently on the same multiprocessor. We integrate POD-Attention in a state-of-the-art LLM inference scheduler Sarathi-Serve. POD-Attention speeds up attention computation by up to 75% (mean 28%) and increases LLM serving throughput by up to 22% in offline inference. In online inference, POD-Attention enables lower time-to-first-token (TTFT), time-between-tokens (TBT), and request execution latency versus Sarathi-Serve.

arxiv情報

著者	Aditya K Kamath,Ramya Prabhu,Jayashree Mohan,Simon Peter,Ramachandran Ramjee,Ashish Panwar
発行日	2024-10-23 17:06:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー