Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

要約

注意メカニズムは、自然言語処理やコンピュータービジョンなど、人工知能のいくつかのドメインに革命をもたらしました。
最近の研究では、注意ベースのモデルにおける勾配降下（GD）の最適化ダイナミクスとその好ましいソリューションの構造的特性を特徴づけていますが、ミラー降下（MD）などのより一般的な最適化アルゴリズムについてはあまり知られていません。
この論文では、ソフトマックスの注意メカニズムに合わせたMDアルゴリズムのファミリーの収束特性と暗黙的バイアスを調査します。
具体的には、これらのアルゴリズムが、softmax注意モデルを使用して分類問題に適用した場合、$ \ ell_p $ normオブジェクトを持つ一般化されたハードマージンSVMに向けて収束することを示します。
特に、我々の理論的結果は、現在の問題の非常に非線形および非凸性の性質にもかかわらず、収束率がより単純なモデルの従来のGDの収束率に匹敵することを明らかにしています。
さらに、キークエリマトリックスとデコーダーのジョイント最適化ダイナミクスを掘り下げ、この複雑なジョイント最適化がそれぞれのハードマージンSVMソリューションに収束する条件を確立します。
最後に、実際のデータに関する数値実験は、MDアルゴリズムが標準GDよりも一般化を改善し、最適なトークン選択において優れていることを示しています。

要約(オリジナル)

Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

arxiv情報

著者	Addison Kristanto Julistiono,Davoud Ataee Tarzanagh,Navid Azizan
発行日	2025-03-21 13:15:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー