EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

要約

Segment Anything Model (SAM) は、視覚的なプロンプトを備えた優れたインタラクティブなセグメンテーション機能で広く注目を集めていますが、テキストプロンプトのさらなる探索は欠けています。
この論文では、どのテキストプロンプトエンコーダ (CLIP や LLM など) が参照表現のセグメンテーションに SAM を適応させるのに適しているかを経験的に調査し、Early Vision- language Fusion-based SAM (EVF-SAM) を紹介します。
EVF-SAM は、マルチモーダルプロンプト (つまり、画像とテキスト) を活用する、シンプルかつ効果的な参照セグメンテーション手法であり、参照プロンプトを生成するための事前トレーニング済み視覚言語モデルとセグメンテーション用の SAM モデルで構成されます。
驚くべきことに、(1) マルチモーダルプロンプトと (2) 早期融合を備えた視覚言語モデル (例: BEIT-3) が、SAM に正確な参照セグメンテーションを促すのに有益であることがわかりました。
私たちの実験は、BEIT-3に基づいて提案されたEVF-SAMが参照表現セグメンテーションに関してRefCOCO/+/gで最先端のパフォーマンスを獲得できることを示し、早期の視覚言語融合によるプロンプトSAMの優位性を実証します。
さらに、1.32B パラメータを備えた提案された EVF-SAM は、大規模なマルチモーダルモデルに基づく以前の SAM 手法と比較してパラメータを 82% 近く削減しながら、著しく高いパフォーマンスを達成します。

要約(オリジナル)

Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.

arxiv情報

著者	Yuxuan Zhang,Tianheng Cheng,Rui Hu,ei Liu,Heng Liu,Longjin Ran,Xiaoxin Chen,Wenyu Liu,Xinggang Wang
発行日	2024-06-28 17:38:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー