Towards Training-free Open-world Segmentation via Image Prompt Foundation Models

要約

コンピュータービジョンの領域では、自然言語処理の領域における大規模な言語モデルの変革的な影響を反映した、基礎モデルの出現によるパラダイムシフトが見られました。
この論文では、オープンワールドセグメンテーションの探求を詳しく掘り下げ、視覚基礎モデルの力を活用するイメージプロンプトセグメンテーション (IPSeg) と呼ばれる新しいアプローチを紹介します。
IPSeg は、イメージプロンプト技術を活用したトレーニング不要のパラダイムの原理に基づいています。
具体的には、IPSeg は、主観的な視覚概念を含む単一の画像を、DINOv2 や Stable Diffusion などのビジョン基盤モデルを照会するための柔軟なプロンプトとして利用します。
私たちのアプローチは、プロンプト画像と入力画像の堅牢な特徴を抽出し、新しい特徴相互作用モジュールを介して入力表現をプロンプト表現に照合して、入力画像内のターゲットオブジェクトを強調表示するポイントプロンプトを生成します。
生成されたポイントプロンプトはさらに、Segment Anything Model をガイドして入力画像内のターゲットオブジェクトをセグメント化するために利用されます。
提案された方法は、徹底的なトレーニングセッションの必要性を排除することで際立っており、それにより、より効率的でスケーラブルなソリューションを提供します。
COCO、PASCAL VOC、およびその他のデータセットに関する実験では、直感的な画像プロンプトを使用した柔軟なオープンワールドセグメンテーションに対する IPSeg の有効性が実証されています。
この研究は、画像で伝えられる視覚的な概念を通じてオープンワールドを理解するための基礎モデルを開拓する先駆者です。

要約(オリジナル)

The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg’s efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.

arxiv情報

著者	Lv Tang,Peng-Tao Jiang,Hao-Ke Xiao,Bo Li
発行日	2023-12-18 00:37:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Training-free Open-world Segmentation via Image Prompt Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー