MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

要約

本研究では、様々な階層的Vision Transformerに適用可能な、シンプルかつ効率的なMIM手法であるMixed and Masked Image Modeling (MixMIM)を提案する。既存のMIM手法は、入力トークンのランダムな部分集合を特殊なMASKシンボルに置き換え、破損した画像から元の画像トークンを復元することを目的としている。しかし、MASK記号を用いると、マスキング率が大きい（例えば、BEiTでは40%）ため、学習が大幅に遅くなり、学習と同調の不整合が発生することがわかった。これに対し、ある画像のマスクトークンを別の画像の可視トークンに置き換える、すなわち混合画像を作成する。そして、混合した入力から元の2枚の画像を復元する二重再構成を行うことで、効率を大幅に向上させます。MixMIMは様々なアーキテクチャに適用可能であるが、本論文ではより単純だが強力な階層型Transformerを探索し、MixMIM-B、-L、-Hとスケーリングする。実証実験の結果、MixMIMは高品質な視覚表現を効率的に学習できることが示された。特に、88Mのパラメータを持つMixMIM-Bは、600エポックの事前学習により、ImageNet-1Kにおいて85.1%のトップ1精度を達成し、モデルサイズが同等のニューラルネットワーク（例えば、ViT-B）のMIM手法の中で新記録を樹立しています。また、他の6つのデータセットでの転送性能は、MixMIMが従来のMIM手法よりもFLOPsと性能のトレードオフが優れていることを示しています。コードは https://github.com/Sense-X/MixMIM で公開されています。

要約(オリジナル)

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special MASK symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the MASK symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the original two images from the mixed input, which significantly improves efficiency. While MixMIM can be applied to various architectures, this paper explores a simpler but stronger hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical results demonstrate that MixMIM can learn high-quality visual representations efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for neural networks with comparable model sizes (e.g., ViT-B) among MIM methods. Besides, its transferring performances on the other 6 datasets show MixMIM has better FLOPs / performance tradeoff than previous MIM methods. Code is available at https://github.com/Sense-X/MixMIM.

arxiv情報

著者	Jihao Liu,Xin Huang,Osamu Yoshie,Yu Liu,Hongsheng Li
発行日	2022-09-06 13:34:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー