Hardware-Efficient Attention for Fast Decoding

要約

LLMデコードは、大きなバッチと長いコンテキストのためにボトルネックされています。キー値（kV）キャッシュは、トークンあたりのレイテンシを膨らませ、デコードの連続的な性質は平行性を制限します。
算術強度、並列化、モデルの品質の相互作用を分析し、現在のアーキテクチャが最新のハードウェアを完全に活用するかどうかを疑問視します。
この作業は、並列スケーラビリティを取引せずにハードウェアの効率を最大化するために、メモリからロードされたバイトごとにより多くの計算を実行するように注意を再設計します。
最初に、キー状態と価値状態を組み合わせて再利用する単純なバリアントであるグループに結合した注意（GTA）を提案し、モデルの品質を損なうことなくメモリ転送を減らします。
次に、高いモデル品質を維持しながら、高速デコードのための低レベルの最適化と組み合わせた並行した潜在的な潜在的な注意であるグループ化された潜在的な注意（GLA）を紹介します。
実験では、GTAはグループ化されたクエリの注意（GQA）品質と一致しながらKVキャッシュの約半分を使用し、GLAはマルチヘッド潜在的注意（MLA）と一致し、シャードが容易であることが示されています。
たとえば、最適化されたGLAカーネルは、Flashmlaよりも最大2ドルの時間$ $速度です。たとえば、クエリの長さが1を超えると投機的なデコード設定で。
さらに、デバイスごとに小さなkVキャッシュを取得することにより、GLAはエンドツーエンドのレイテンシを減らし、オンラインサービングベンチマークのスループットを最大2 $ \ Times $だけ増加させます。

要約(オリジナル)

LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2$\times$ faster than FlashMLA, for example, in a speculative decoding setting when the query length exceeds one. Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-end latency and increases throughput in online serving benchmarks by up to 2$\times$.

arxiv情報

著者	Ted Zadouri,Hubert Strauss,Tri Dao
発行日	2025-05-27 17:54:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hardware-Efficient Attention for Fast Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー