CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

要約

既存のオープンボキャブラリーの画像セグメンテーション方法では、マスクの注釈や画像テキストデータセットの微調整ステップが必要です。
マスクラベルは労働集約的であるため、セグメンテーションデータセット内のカテゴリの数が制限されます。
その結果、事前トレーニングされた VLM のオープン語彙能力は、微調整後に大幅に低下します。
ただし、微調整を行わないと、画像とテキストの弱い監視下でトレーニングされた VLM は、画像内に存在しない概念を参照するテキストクエリがある場合に最適ではないマスク予測を行う傾向があります。
これらの問題を軽減するために、無関係なテキストを段階的に除外し、トレーニングの努力なしでマスクの品質を向上させる新しい反復フレームワークを導入します。
反復ユニットは、固定ウェイトを備えた VLM 上に構築された 2 段階のセグメンターです。
したがって、私たちのモデルは VLM の幅広い語彙空間を保持し、そのセグメンテーション機能を強化します。
実験結果は、私たちの方法がトレーニング不要の対応する方法だけでなく、何百万もの追加データサンプルで微調整された方法よりも優れており、ゼロショットセマンティックタスクと参照画像セグメンテーションタスクの両方で新しい最先端の記録を樹立することを示しています。
。
具体的には、Pascal VOC、COCO Object、Pascal Context で現在の記録を 28.8、16.0、および 6.9 mIoU 改善します。

要約(オリジナル)

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM’s broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

arxiv情報

著者	Shuyang Sun,Runjia Li,Philip Torr,Xiuye Gu,Siyang Li
発行日	2023-12-12 19:00:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー