Anisotropy Is Inherent to Self-Attention in Transformers

要約

表現縮退問題は、Transformers に基づく自己教師あり学習手法の間で広く観察される現象です。
NLP では、これは異方性の形をとります。これは隠れた表現の特異な特性であり、角度距離 (コサイン類似度) の観点から互いに予想外に近くなります。
最近の研究では、異方性がトークンのロングテール分布におけるクロスエントロピー損失の最適化の結果であることを示す傾向があります。
この論文では、異方性は、同じ結果を直接受けるべきではない特定の目的を持つ言語モデルでも経験的に観察できることを示します。
また、異方性の問題が他のモダリティで訓練されたトランスフォーマーにも及ぶことも示します。
私たちの観察は、異方性が実際にはトランスフォーマーベースのモデルに固有のものであることを示唆しています。

要約(オリジナル)

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations suggest that anisotropy is actually inherent to Transformers-based models.

arxiv情報

著者	Nathan Godey,Éric de la Clergerie,Benoît Sagot
発行日	2024-01-22 17:26:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Anisotropy Is Inherent to Self-Attention in Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー