What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

要約

変圧器アーキテクチャは、間違いなく深い学習に革命をもたらし、多層パーセプトロン（MLP）や畳み込みニューラルネットワーク（CNNS）などの古典的なアーキテクチャを追い抜いています。
そのコアでは、注意ブロックは、MLPS/CNNと比較して、トランスが適応型オプティマイザー、層の正規化、学習率のウォームアップなどをより頻繁に伴う程度まで、深い学習における他のほとんどのアーキテクチャコンポーネントと形式と機能が異なります。
この作業では、（損失）ヘシアンの理論的比較に基づいた他のアーキテクチャと変圧器を区別するものの基本的な理解を提供することにより、このギャップを埋めます。
具体的には、単一の自己関節層の場合、（a）最初にトランスのヘシアンを完全に導き出し、マトリックス誘導体で表現します。
（b）次に、データ、重量、および注意モーメントの依存性の観点から特徴付けます。
（c）そうしている間、古典的なネットワークのヘシアンの重要な構造の違いをさらに強調します。
我々の結果は、変圧器のさまざまな一般的な建築と最適化の選択が、パラメーター間で不均一に異なるデータと重量マトリックスの非常に非線形依存関係にまでさかのぼることができることを示唆しています。
最終的に、私たちの調査結果は、変圧器のユニークな最適化環境とそれがもたらす課題をより深く理解しています。

要約(オリジナル)

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning–to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures–grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer’s Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer’s unique optimization landscape and the challenges it poses.

arxiv情報

著者	Weronika Ormaniec,Felix Dangel,Sidak Pal Singh
発行日	2025-03-17 17:32:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー