Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

要約

マスクイメージモデリング (MIM) は、画像のマスクされた部分から欠落しているピクセルを予測することによって、ラベルのない画像データから視覚的表現を導き出すための有望な方法として浮上しています。
これは領域認識学習に優れており、さまざまなタスクに強力な初期化を提供しますが、ピクセル再構成の目的が低レベルであるため、さらに監視された微調整なしで高レベルのセマンティクスを取得するのに苦労します。
有望だがまだ実現されていないフレームワークは、MIM の局所性と高レベルのターゲットを組み合わせた、潜在空間でのマスクされた再構成を通じて表現を学習することです。
ただし、このアプローチでは、再構成ターゲットがモデルと組み合わせて学習されるため、トレーニングに重大な課題が生じ、自明な解決策や次善の解決策につながる可能性があります。私たちの研究は、そのようなフレームワークの課題を徹底的に分析し、対処した最初の研究の 1 つです。
潜在的な MIM。
一連の慎重に設計された実験と広範な分析を通じて、オンライン/ターゲットの共同最適化のための表現の崩壊、学習目標、潜在空間の高領域相関、およびデコード条件付けなど、これらの課題の原因を特定します。
これらの問題に順次対処することで、Latent MIM が実際に MIM モデルの利点を維持しながら高レベルの表現を学習できることを実証します。

要約(オリジナル)

Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. It excels in region-aware learning and provides strong initializations for various tasks, but struggles to capture high-level semantics without further supervised fine-tuning, likely due to the low-level nature of its pixel reconstruction objective. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. However, this approach poses significant training challenges as the reconstruction targets are learned in conjunction with the model, potentially leading to trivial or suboptimal solutions.Our study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM. Through a series of carefully designed experiments and extensive analysis, we identify the source of these challenges, including representation collapsing for joint online/target optimization, learning objectives, the high region correlation in latent space and decoding conditioning. By sequentially addressing these issues, we demonstrate that Latent MIM can indeed learn high-level representations while retaining the benefits of MIM models.

arxiv情報

著者	Yibing Wei,Abhinav Gupta,Pedro Morgado
発行日	2024-07-22 17:54:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー