On the Expressivity Role of LayerNorm in Transformers’ Attention

要約

タイトル：レイヤーノーマライゼーションがトランスフォーマーのアテンションにおける表現力に果たす役割について

要約：

– トランスフォーマーにおいてレイヤーノーマライゼーション(LayerNorm)は必須の要素である。
– 本論文では、レイヤーノーマライゼーションが多頭アテンション層の表現力に不可欠であることを示す。
– レイヤーノーマライゼーションは、フォワードパス中のアクティベーション、およびバックワードパス中の勾配を正規化する役割だけではなく、アテンション層の表現力にも重要な役割を果たす。
– レイヤーノーマライゼーションは、2つの要素から成り立ち、それぞれ(a)入力ベクトルを$\left[1,1,…,1\right]$ベクトルに直交する$d-1$次元空間に射影し、(b)全てのベクトルのノルムを$\sqrt{d}$にスケーリングするものである。
– これら2つの要素が、トランスフォーマーにおいてアテンション層に重要な役割を果たすことを実証。射影によって、アテンションメカニズムがすべてのキーに等しくアテンドするアテンションクエリを作成し、スケーリングによって、各キーが最も高いアテンションを受け取ることができ、キーが不選択になることを防ぐ。
– 多種多様な言語モデリングや、’majority’などのシンプルな関数の計算において、レイヤーノーマライゼーションによる上記の性質がトランスフォーマーの性能向上に繋がることを示す。
– ソースコードは https://github.com/tech-srl/layer_norm_expressivity_role で公開されている。

要約(オリジナル)

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm’s only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d-1$ space that is orthogonal to the $\left[1,1,…,1\right]$ vector, and (b) scaling of all vectors to the same norm of $\sqrt{d}$. We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being ‘un-select-able’. We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as ‘majority’. Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .

arxiv情報

著者	Shaked Brody,Uri Alon,Eran Yahav
発行日	2023-05-04 06:32:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

On the Expressivity Role of LayerNorm in Transformers’ Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー