Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

要約

弱教師あり意味セグメンテーション (WSSS) における対照言語画像事前トレーニング (CLIP) のアプリケーションは、強力なクロスモーダル意味理解機能を研究します。
既存の方法では、テキストのプロトタイプを細かく調整して意味的な一致を容易にすることで、画像とテキストの配置を改善するために入力テキストプロンプトを最適化しようとしています。
それにもかかわらず、テキスト空間と視覚空間の間にモダリティのギャップがあることを考えると、これらの方法で使用されるテキストのプロトタイプは、ピクセルレベルの視覚特徴との密接な対応を効果的に確立していません。
この研究では、理論的分析により、固有のモダリティギャップによりテキストと領域の特徴の不整合が生じ、このギャップは CLIP でのコントラスト損失を最小限に抑えても十分に低減できないことが示されています。
モダリティギャップの影響を軽減するために、より代表的なビジョンプロトタイプを導入することにより、ビジョンプロトタイプ学習（VPL）フレームワークを提案します。
このフレームワークの核心は、高品質のローカリゼーションマップをキャプチャするために、テキストプロトタイプの助けを借りてビジョン空間でクラス固有のビジョンプロトタイプを学習することです。
さらに、対応するプロトタイプを埋め込んだ領域を対比する領域セマンティックコントラストモジュールを提案し、より包括的で堅牢な特徴学習につながります。
実験結果は、私たちが提案したフレームワークが 2 つのベンチマークデータセットで最先端のパフォーマンスを達成することを示しています。

要約(オリジナル)

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, for capturing high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.

arxiv情報

著者	Zhongxing Xu,Feilong Tang,Zhe Chen,Yingxue Su,Zhiyi Zhao,Ge Zhang,Jionglong Su,Zongyuan Ge
発行日	2024-12-27 13:55:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー