Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding


Top-Thetaの注意と呼ばれる新しいアプローチ、または単にTop-$ \ Theta $を紹介します。これは、慎重に較正されたしきい値と比較することにより、あまり重要でない注意要素を選択的にプルーナします。
Top-Kの注意とは異なり、Top-$ \ Theta $はフルベクトルの依存関係を排除し、タイリングとスケールアウト、および費用のかかるTOP-K検索を回避するのに適しています。


The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.


著者 Konstantin Berestizshevsky,Renzo Andri,Lukas Cavigelli
発行日 2025-02-12 12:50:15+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: 68T01, cs.AI, cs.CL, I.2 パーマリンク