More Expressive Attention with Negative Weights

要約

COG Attencesという名前の新しい注意メカニズムを提案します。これにより、2つの重要な要因に由来する、表現力の向上のために注意の重みが負になります。（1）COGの注意はパラメーターの柔軟性を高めます。
たとえば、静的出力値（OV）マトリックスを使用してヘッドが出席する入力を削除またはコピーする従来のソフトマックスの注意ヘッドとは異なり、COG Attenceは自然にDynamic Query-Key（QK）内製品のサインを使用することを学びます。
これらの操作。
これにより、COGの注意は、単一のヘッド内で複数の操作を同時に実行できます。
一方、COG AttenceのOVマトリックスは、改良または変更により重点を置くことができます。
（2）COGの注意は、以前のトークンの「オーバースケッシング」を後の位置に防ぐことにより、表現崩壊に対するモデルの堅牢性を高めます。
言語モデリング用のさまざまなスケールでのデコーダーのみのモデルや、画像生成のU-vit拡散モデルなど、COGの注意を注意モジュールとして使用するトランス状態のモデルを開発します。
実験では、COGの注意を使用したモデルは、従来のSoftMax注意モジュールを使用しているモデルと比較して、優れた性能を示すことが示されています。
私たちのアプローチは、非ネガティブウェイトの要件など、従来のソフトマックスの注意の定着した制約を再考し、破壊するための有望な研究の方向性を示唆しています。

要約(オリジナル)

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention’s OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model’s robustness against representational collapse by preventing the “over-squashing” of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.

arxiv情報

著者	Ang Lv,Ruobing Xie,Shuaipeng Li,Jiayi Liao,Xingwu Sun,Zhanhui Kang,Di Wang,Rui Yan
発行日	2025-01-30 18:17:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

More Expressive Attention with Negative Weights

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー