Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

要約

密な注釈なしで画像とテキストのペアのみを使用して、画像内の任意の視覚的概念をセグメント化することを学習することを目的とした、オープンワールドのセマンティックセグメンテーションに取り組みます。
既存のオープンワールドセグメンテーション手法は、対照学習 (CL) を採用して多様な視覚的概念を学習し、学習した画像レベルの理解をセグメンテーションタスクに適応させることで、目覚ましい進歩を遂げています。
ただし、CL に基づくこれらの方法には矛盾があります。これは、トレーニング時に画像とテキストのレベルの配置のみを考慮するのに対し、セグメンテーションタスクではテスト時に領域とテキストのレベルの配置が必要になるためです。
このホワイトペーパーでは、テキストとテキストによって記述された領域を直接整列させて、トレーニングテストの不一致に対処する、新しいテキストベースの対照学習（TCL）フレームワークを提案します。
私たちの方法は、特定のテキストに関連付けられたセグメンテーションマスクを生成し、マスクされた領域からグラウンディングされた画像埋め込みを抽出し、TCL を介してテキスト埋め込みと位置合わせします。
このフレームワークは、画像とテキストのレベルの配置ではなく領域とテキストのレベルの配置をモデルに学習させることで不一致に対処し、モデルが生成されたセグメンテーションマスクの品質を直接改善することを奨励します。
さらに、厳密かつ公正な比較のために、広く使用されている 8 つのセマンティックセグメンテーションデータセットを使用した統一評価プロトコルを提示します。
TCL は、すべてのデータセットで大きなマージンを持つ最先端のゼロショットセグメンテーションパフォーマンスを実現します。
コードは https://github.com/kakaobrain/tcl で入手できます。

要約(オリジナル)

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and adapting the learned image-level understanding to the segmentation task. However, these methods based on CL have a discrepancy since it only considers image-text level alignment in training time, while the segmentation task requires region-text level alignment at test time. In this paper, we propose a novel Text-grounded Contrastive Learning (TCL) framework to directly align a text and a region described by the text to address the train-test discrepancy. Our method generates a segmentation mask associated with a given text, extracts grounded image embedding from the masked region, and aligns it with text embedding via TCL. The framework addresses the discrepancy by letting the model learn region-text level alignment instead of image-text level alignment and encourages the model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performance with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

arxiv情報

著者	Junbum Cha,Jonghwan Mun,Byungseok Roh
発行日	2022-12-01 18:59:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー