TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

要約

大規模言語モデル (LLM) 推論では、特にキーアテンションメカニズムにおいて、大量の計算とメモリが必要になります。
FlashAttendant のような量子化アルゴリズムや加速アルゴリズムなどの技術は、全体的な推論の効率を向上させますが、問題のさまざまな側面に対処します。量子化は重み付けアクティベーション操作に焦点を当てているのに対し、FlashAttendant は実行を改善しますが、高精度のフォーマットが必要です。
最近の Key-Value (KV) キャッシュ量子化によりメモリ帯域幅は減少しますが、アテンション操作には依然として浮動小数点逆量子化が必要です。
私たちは、メモリと計算効率の両方に同時に対処するアテンションの量子化された実行を可能にする包括的なアプローチである TurboAttend を紹介します。
当社のソリューションには 2 つの重要な革新が導入されています。FlashQ は、KV キャッシュの圧縮とアクティベーション – アクティベーション乗算の量子化された実行の両方を可能にするヘッドワイズアテンション量子化技術であり、もう 1 つは、べき乗中に FP32 への逆量子化の必要性を排除するスパーシティベースのソフトマックス近似 (SAS) です。
操作には注意してください。
実験結果は、TurboAttend がアテンションで 1.2 ～ 1.8 倍の高速化を達成し、KV キャッシュサイズを 4.4 倍以上削減し、FP16 ベースラインと比較して最大 2.37 倍の最大スループットを可能にし、同時に最先端の量子化および圧縮技術を上回っていることを示しています。
さまざまなデータセットとモデル。

要約(オリジナル)

Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.

arxiv情報

著者	Hao Kang,Srikant Bharadwaj,James Hensman,Tushar Krishna,Victor Ruhle,Saravan Rajmohan
発行日	2024-12-17 05:40:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー