Rethinking Hierarchies in Pre-trained Plain Vision Transformer

要約

マスク画像モデリング(MIM)による自己教師付きビジョン変換器(ViT)の事前学習は、非常に有効であることが証明されている。しかし、階層型ViTでは、単純なMAEを用いるのではなく、GreenMIMのようなカスタマイズされたアルゴリズムを慎重に設計する必要がある。さらに重要なことは、これらの階層的ViTは、あらかじめ学習されたViTの重みを再利用できないため、事前学習の必要性から膨大な計算コストが発生し、アルゴリズムと計算の複雑さの両方が生じることである。本論文では、この問題を解決するために、自己教師付き事前学習から階層的なアーキテクチャ設計を切り離すという新しいアイデアを提案する。我々は、最小限の変更で、平易なViTを階層的なものに変換する。技術的には、線形埋め込み層のストライドを16から4に変更し、変換ブロックの間に畳み込み（または単純平均）プーリング層を追加することで、特徴量を1/4から1/32に順次縮小することができる。その結果、ImageNet、MS COCO、Cityscapes、ADE20Kの各ベンチマークにおいて、分類、検出、分割の各タスクで、ViTベースラインを上回る性能を示した。この予備的な研究により、既製のチェックポイントを活用し、事前学習コストを回避しながら効果的な（階層的な）ViTを開発することに、より多くのコミュニティが関心を寄せることを期待しています。コードとモデルは、https://github.com/ViTAE-Transformer/HPViT で公開される予定です。

要約(オリジナル)

Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.

arxiv情報

著者	Yufei Xu,Jing Zhang,Qiming Zhang,Dacheng Tao
発行日	2022-11-08 15:07:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Rethinking Hierarchies in Pre-trained Plain Vision Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー