White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

要約

この論文では、表現学習の自然な目的は、データの分布、たとえばトークンのセットを、インコヒーレントな部分空間でサポートされる低次元のガウス混合に圧縮して変換することであると主張します。
このような表現の良さは、学習された表現の固有の情報ゲインと外部のスパース性を同時に最大化する、スパースレート削減と呼ばれる原理的な尺度によって評価できます。
この観点から、トランスフォーマーを含む一般的なディープネットワークアーキテクチャは、この対策を最適化するための反復スキームを実現しているとみなすことができます。
特に、この目的の一部で交互の最適化からトランスフォーマーブロックを導出します。マルチヘッドセルフアテンションオペレーターは、特徴のコーディングレートに近似勾配降下ステップを実装することで表現を圧縮し、その後の多層パーセプトロンはスパース化します。
特徴。
これは、数学的に完全に解釈可能な、CRATE と呼ばれるホワイトボックストランスフォーマーのようなディープネットワークアーキテクチャのファミリーにつながります。
我々は、ノイズ除去と圧縮の間の新しい関係によって、前述の圧縮符号化の逆が同じクラスの CRATE アーキテクチャによって実現できることを示します。
したがって、そのようにして派生したホワイトボックスアーキテクチャは、エンコーダとデコーダの両方に共通です。
実験の結果、これらのネットワークは、その単純さにも関わらず、実際に大規模な実世界の画像およびテキストデータセットの表現を圧縮およびスパース化する方法を学習し、高度に設計されたトランスフォーマーベースのモデル (ViT、MAE、DINO、BERT、
そしてGPT2。
私たちは、提案された計算フレームワークが、データ圧縮の統一された観点から、ディープラーニングの理論と実践の間のギャップを埋める上で大きな可能性を示すと信じています。
コードは https://ma-lab-berkeley.github.io/CRATE から入手できます。

要約(オリジナル)

In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .

arxiv情報

著者	Yaodong Yu,Sam Buchanan,Druv Pai,Tianzhe Chu,Ziyang Wu,Shengbang Tong,Hao Bai,Yuexiang Zhai,Benjamin D. Haeffele,Yi Ma
発行日	2023-11-24 09:18:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー