Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

要約

最近、テキスト監視からオープンボキャブラリーのセマンティックセグメンテーションを学習することで、下流で有望なパフォーマンスが達成されました。
それにもかかわらず、現在のアプローチは、密なアノテーションが存在しないために位置合わせの粒度のギャップに遭遇し、トレーニング中に粗い画像/領域とテキストの位置合わせを学習しますが、推論時にグループ/ピクセルレベルの予測を実行します。
このような不一致は、学習効率が最適ではなく、ゼロショットセグメンテーションの結果が劣ることにつながります。
このペーパーでは、マルチグレインクロスモーダルアライメント (MGCA) フレームワークを紹介します。このフレームワークは、オブジェクトレベルおよび領域レベルのアライメントとともにピクセルレベルのアライメントを明示的に学習して、高密度のアノテーションを使用せずに粒度のギャップを埋めることができます。
具体的には、MGCA は、画像とテキストのペアに対して擬似的な多粒度の意味対応を巧みに構築し、ハードサンプリング戦略と連携して、きめの細かいクロスモーダル対比学習を促進します。
さらに、下流のセグメンテーションにおける既存のグループおよびピクセル予測ユニットの欠陥を指摘し、過小セグメンテーションや過剰セグメンテーションなどのジレンマを効果的に軽減する適応型セマンティックユニットを開発します。
CC3M のみでトレーニングする当社のメソッドは、最先端のメソッドよりも大幅な進歩を達成し、その有効性と効率性を実証しています。

要約(オリジナル)

Recently, learning open-vocabulary semantic segmentation from text supervision has achieved promising downstream performance. Nevertheless, current approaches encounter an alignment granularity gap owing to the absence of dense annotations, wherein they learn coarse image/region-text alignment during training yet perform group/pixel-level predictions at inference. Such discrepancy leads to suboptimal learning efficiency and inferior zero-shot segmentation results. In this paper, we introduce a Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations. Specifically, MGCA ingeniously constructs pseudo multi-granular semantic correspondences upon image-text pairs and collaborates with hard sampling strategies to facilitate fine-grained cross-modal contrastive learning. Further, we point out the defects of existing group and pixel prediction units in downstream segmentation and develop an adaptive semantic unit which effectively mitigates their dilemmas including under- and over-segmentation. Training solely on CC3M, our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.

arxiv情報

著者	Yajie Liu,Pu Ge,Qingjie Liu,Di Huang
発行日	2024-03-06 13:43:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー