Attention-Only Transformers and Implementing MLPs with Attention Heads

要約

トランスフォーマーアーキテクチャは機械学習モデルで広く使用されており、アテンションヘッドと MLP という 2 つの交互のサブレイヤーで構成されます。
MLP ニューロンは、MLP の活性化関数が SiLU と ReLU および GeLU の近似を含む制限されたクラスに由来する限り、内部次元 1 のマスクされたアテンションヘッドによって実装できることを証明します。
これにより、アテンションヘッドの数を大幅に増やすことを犠牲にして、MLP およびアテンショントランスフォーマをアテンション専用トランスフォーマに変換できます。
また、アテンションヘッドが MLP のコンポーネント (線形変換と活性化関数) を個別に実行できることも証明します。
最後に、アテンションヘッドが重み行列の任意のマスキングパターンを任意の小さな誤差内でエンコードできることを証明します。

要約(オリジナル)

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP’s activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

arxiv情報

著者	Robert Huben,Valerie Morris
発行日	2023-09-15 17:47:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Attention-Only Transformers and Implementing MLPs with Attention Heads

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー