MetaFormer Baselines for Vision

要約

Transformer の抽象化されたアーキテクチャである MetaFormer は、競争力のあるパフォーマンスを達成する上で重要な役割を果たすことがわかっています。
このホワイトペーパーでは、トークンミキサーの設計に焦点を当てることなく、MetaFormer の能力をさらに調査します。最も基本的な、または一般的なミキサーを使用して MetaFormer の下にいくつかのベースラインモデルを紹介し、次のように観察結果を要約します。
パフォーマンスの限界。
トークンミキサーとして ID マッピングを採用するだけで、IdentityFormer と呼ばれる MetaFormer モデルは、ImageNet-1K で 80% を超える精度を達成します。
(2) MetaFormer は、任意のトークンミキサーでうまく機能します。
トークンミキサーをランダムマトリックスとして指定してトークンを混合すると、結果として得られるモデル RandFormer の精度は 81% を超え、IdentityFormer よりも優れています。
新しいトークンミキサーが採用されたときの MetaFormer の結果に安心してください。
(3) MetaFormer は、最先端の結果を簡単に提供します。
5 年前にさかのぼる従来のトークンミキサーだけで、MetaFormer からインスタンス化されたモデルは、すでに最新技術を上回っています。
(a) ConvFormer は ConvNeXt よりも優れています。
一般的な深さ方向の分離可能な畳み込みをトークンミキサーとして使用すると、純粋な CNN と見なすことができる ConvFormer と呼ばれるモデルは、強力な CNN モデル ConvNeXt よりも優れています。
(b) CAFormer が ImageNet-1K で新記録を樹立
下段ではトークンミキサーとして深さ方向に分離可能な畳み込みを適用し、上段ではバニラセルフアテンションを適用するだけで、結果として得られるモデル CAFormer は ImageNet-1K で新しい記録を打ち立てます。通常の監視下で、224×224 解像度で 85.5% の精度を達成します。
外部データや蒸留なしのトレーニング。
MetaFormer を調査するための調査では、新しいアクティベーションである StarReLU が GELU と比較してアクティベーションの 71% の FLOP を削減し、さらに優れたパフォーマンスを達成することもわかりました。
StarReLU は、他のニューラルネットワークと並んで、MetaFormer のようなモデルに大きな可能性を見出していると期待しています。

要約(オリジナル)

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer’s results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224×224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.

arxiv情報

著者	Weihao Yu,Chenyang Si,Pan Zhou,Mi Luo,Yichen Zhou,Jiashi Feng,Shuicheng Yan,Xinchao Wang
発行日	2022-12-22 17:56:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MetaFormer Baselines for Vision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー