Transformers without Normalization

要約

正規化層は、現代のニューラルネットワークで遍在しており、長い間不可欠であると考えられてきました。
この作業は、正規化のないトランスは、非常にシンプルな手法を使用して同じパフォーマンスまたはより良いパフォーマンスを達成できることを示しています。
変圧器の正規化層のドロップイン置換として、要素ごとの操作$ dyt（$ x $）= \ tanh（\ alpha $ x $）$であるダイナミックタン（dyt）を紹介します。
DYTは、変圧器の層の正規化がタン状の$ s $ shaped出力マッピングを生成することが多いという観察に触発されています。
DYTを組み込むことにより、正規化なしのトランスは、主にハイパーパラメーターの調整なしで、正規化された対応物のパフォーマンスと一致またはそれを超えることができます。
私たちは、認識から世代に至るまで、さまざまな設定を介したDYTを使用して、変圧器の有効性を検証し、自己教師の学習、およびコンピュータービジョンから言語モデルに至ります。
これらの調査結果は、正規化層が現代のニューラルネットワークで不可欠であるという従来の理解に挑戦し、ディープネットワークでの役割に関する新しい洞察を提供します。

要約(オリジナル)

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

arxiv情報

著者	Jiachen Zhu,Xinlei Chen,Kaiming He,Yann LeCun,Zhuang Liu
発行日	2025-03-13 17:59:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Transformers without Normalization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー