The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

要約

深層学習理論では、表現の共分散行列はネットワークの訓練可能性を調べるための代用として機能します。
Transformers の成功を動機として、無限の深さと幅の比例制限におけるスキップ接続を使用した、修正された Softmax ベースのアテンションモデルの共分散行列を研究します。
初期化において、限界分布は深さと幅の比によって指標付けされた確率微分方程式 (SDE) によって記述できることを示します。
明確に定義された確率的制限を達成するために、Transformer のアテンションメカニズムは、Softmax 出力を恒等点に中心化し、幅に依存する温度パラメータによって Softmax ロジットをスケーリングすることによって変更されます。
対応する SDE を通じてネットワークの安定性を調べ、残留接続を利用してドリフトと拡散の両方の規模をどのようにエレガントに制御できるかを示します。
安定した SDE の存在は、深さと幅が非常に大きい場合でも共分散構造が適切に動作することを意味し、したがってディープアテンションモデルにおけるランク縮退という悪名高い問題が防止されます。
最後に、シミュレーションを通じて、SDE が対応する有限サイズモデルの驚くほど優れた記述を提供することを示します。
私たちは、これらのアーキテクチャ上の変更を形どった Transformer という名前を作りました。

要約(オリジナル)

In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network’s trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer’s attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

arxiv情報

著者	Lorenzo Noci,Chuning Li,Mufan Bill Li,Bobby He,Thomas Hofmann,Chris Maddison,Daniel M. Roy
発行日	2023-06-30 16:10:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー