The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

要約

マスクされたオートエンコーダ (MAE) は最近、自己教師あり視覚表現学習に成功しました。
以前の研究では、主にカスタム設計 (例: ランダム、ブロック単位) マスキングまたは教師 (例: CLIP) ガイド付きマスキングとターゲットが適用されていました。
ただし、マスキングとターゲットについて教師にフィードバックを与えるという自己訓練 (生徒) モデルの潜在的な役割が無視されています。
この作業では、協調マスキングと、マスクされたオートエンコーダをブーストするためのターゲット、つまり CMT-MAE を統合する方法を紹介します。
具体的には、CMT-MAE は、教師モデルと生徒モデルの両方からのアテンションにわたる線形集約を通じて、シンプルな協調マスキングメカニズムを活用します。
さらに、これら 2 つのモデルからの出力特徴をデコーダーの共同ターゲットとして使用することを提案します。
ImageNet-1K で事前トレーニングされたシンプルで効果的なフレームワークは、最先端の線形プローブと微調整パフォーマンスを実現します。
特に、ViT-base を使用することで、バニラ MAE の微調整結果が 83.6% から 85.7% に向上しました。

要約(オリジナル)

Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.

arxiv情報

著者	Shentong Mo
発行日	2024-12-23 13:37:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー