Masked Visual Reconstruction in Language Semantic Space

要約

マスクされたイメージモデリング (MIM) と自然言語の監督の両方が、伝達可能な視覚的な事前トレーニングの進行を促進しました。
この作業では、2 つのパラダイム間の相乗効果を追求し、MIM が自然言語の監督と出会ったときに出現する特性を研究します。
この目的のために、テキストエンコーダーによってエンコードされた文表現がプロトタイプとして機能し、視覚のみの信号をパッチセンテンス確率に変換する、新しいマスクされた視覚的再構成言語セマンティック空間 (RILS) 事前トレーニングフレームワークを提示します。
意味的に意味のある MIM 再構築ターゲット。
したがって、ビジョンモデルは、マスクされたトークンの適切なセマンティックを予測することにより、構造化された情報を使用して有用なコンポーネントをキャプチャできます。
より良い視覚的表現は、効果的な MIM ターゲット変換に不可欠な画像とテキストの配置目標を介して、テキストエンコーダーを改善する可能性があります。
広範な実験結果は、私たちの方法が以前の MIM と CLIP の長所を享受するだけでなく、それらの相互利益のためにさまざまなタスクでさらなる改善を達成することを示しています。
RILS は、特にローショット体制の場合、下流の分類、検出、およびセグメンテーションで高度な転送可能性を示します。
コードは https://github.com/hustvl/RILS で入手できるようになります。

要約(オリジナル)

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code will be made available at https://github.com/hustvl/RILS.

arxiv情報

著者	Shusheng Yang,Yixiao Ge,Kun Yi,Dian Li,Ying Shan,Xiaohu Qie,Xinggang Wang
発行日	2023-01-17 15:32:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Visual Reconstruction in Language Semantic Space

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー