Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

要約

画像エンコーダの自己教師あり事前トレーニングは、特にマスクされたオートエンコーダ (MAE) の導入後に文献に広く存在します。
現在の取り組みでは、ビデオの動きからオブジェクト中心の表現を学習しようとしています。
特に、SiamMAE は最近、Siamese ネットワークを導入し、ビデオの 2 つのフレームから高い非対称マスキング率 (95%) で共有重みエンコーダーをトレーニングしました。
この研究では、SiamMAE によって導入されたシャム事前トレーニングの代替アプローチである CropMAE を提案します。
私たちの方法は、ビデオから抽出された従来のフレームのペアとは異なり、同じ画像から供給され、異なる方法でトリミングされたトリミングされた画像のペアのみを考慮する点で特に異なります。
したがって、CropMAE はビデオデータセットの必要性を軽減しながら、競争力のあるパフォーマンスを維持し、事前トレーニング時間を大幅に短縮します。
さらに、CropMAE が明示的なモーションなしで同様のオブジェクト中心の表現を学習することを実証し、現在の自己教師あり学習手法がモーションからオブジェクトを学習するのではなく、シャムアーキテクチャのおかげで学習することを示しています。
最後に、CropMAE はこれまでで最高のマスキング率 (98.5%) を達成し、2 つの可視パッチのみを使用して画像を再構成できます。
私たちのコードは https://github.com/alexandre-eymael/CropMAE で入手できます。

要約(オリジナル)

Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.

arxiv情報

著者	Alexandre Eymaël,Renaud Vandeghen,Anthony Cioppa,Silvio Giancola,Bernard Ghanem,Marc Van Droogenbroeck
発行日	2024-03-26 16:04:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー