ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

要約

最近の大規模な対照的言語画像事前トレーニング (CLIP) の成功により、画像とテキストを揃えた知識をピクセルレベルの分類に移すことにより、ゼロショットセマンティックセグメンテーションが大いに期待できるようになりました。
ただし、既存の方法では通常、追加の画像エンコーダや CLIP モジュールの再トレーニング/調整が必要です。
ここでは、最適なトランスポートを通じて複数のテキストプロンプトとフリーズされた画像埋め込みを照合する、新しいゼロショットセグメンテーションウィズオプティマルトランスポート (ZegOT) 手法を提案します。
特に、複数のテキストプロンプトとフリーズされたイメージエンコーダーの隠れ層の視覚的特徴マップの間の最適なマッピングを学習するように設計された、新しいマルチプロンプト最適トランスポートソルバー (MPOT) を導入します。
この独自のマッピング方法により、複数のテキストプロンプトのそれぞれが、個別の視覚的意味属性に効果的に焦点を当てることが容易になります。
ベンチマークデータセットに対する広範な実験を通じて、私たちの手法が既存のゼロショットセマンティックセグメンテーション (ZS3) アプローチを上回る最先端 (SOTA) パフォーマンスを達成することを示しました。

要約(オリジナル)

Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport. In particular, we introduce a novel Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an optimal mapping between multiple text prompts and visual feature maps of the frozen image encoder hidden layers. This unique mapping method facilitates each of the multiple text prompts to effectively focus on distinct visual semantic attributes. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance over existing Zero-shot Semantic Segmentation (ZS3) approaches.

arxiv情報

著者	Kwanyoung Kim,Yujin Oh,Jong Chul Ye
発行日	2023-05-30 13:46:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー