CAE v2: Context Autoencoder with CLIP Target

要約

マスクイメージモデリング (MIM) は、イメージパッチをマスキングして再構成することにより、視覚表現を学習します。
CLIP表現に再構成監視を適用することは、MIMに効果的であることが証明されています。
ただし、MIM での CLIP 監視がパフォーマンスにどのように影響するかはまだ調査されていません。
CLIP を対象とした MIM を改良するための戦略を調査するために、MIM の 2 つの重要な要素、つまり監視位置とマスク比率を研究し、開発した単純なパイプラインである CLIP ターゲットを使用したコンテキストオートデコーダー (CAE v2
）。
まず、目に見えるパッチの監視は、既存の MIM メソッドの標準形式であるマスクされたパッチの監視よりも優れた優れたパフォーマンスを達成することがわかります。
第 2 に、最適なマスク比率はモデルサイズと正の相関があります。
つまり、モデルが小さいほど、マスク率を低くする必要があります。
これら 2 つの発見に基づいて、私たちのシンプルで簡潔なアプローチ CAE v2 は、一連のダウンストリームタスクで優れたパフォーマンスを実現します。
たとえば、バニラの ViT-Large モデルは、ImageNet-1K での線形プロービングと微調整で 81.7% と 86.7% のトップ 1 精度を達成し、ADE20K でのセマンティックセグメンテーションで 55.9% の mIoU を 300 エポックの事前トレーニングで達成します。
私たちの調査結果が、MIM 分野、特に小規模モデルの事前トレーニングのガイドラインとして役立つことを願っています。

要約(オリジナル)

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.

arxiv情報

著者	Xinyu Zhang,Jiahui Chen,Junkun Yuan,Qiang Chen,Jian Wang,Xiaodi Wang,Shumin Han,Xiaokang Chen,Jimin Pi,Kun Yao,Junyu Han,Errui Ding,Jingdong Wang
発行日	2022-11-17 18:58:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAE v2: Context Autoencoder with CLIP Target

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー