RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

要約

Segment Anything Model (SAM) は、画像セグメンテーションにおける優れたパフォーマンスで大きな注目を集めています。
ただし、正確なユーザー対話型プロンプトの必要性と、言語や視覚などのさまざまなモダリティの理解が限られているため、ビデオオブジェクトセグメンテーション (RVOS) を参照する能力が不足しています。
この論文では、さまざまなモダリティおよび異なるタイムスタンプでの連続フレームからのマルチビュー情報を組み込むことによって、RVOS に対する SAM の可能性を初めて探求する RefSAM モデルを紹介します。
私たちが提案するアプローチは、参照表現のテキスト埋め込みを疎埋め込みと密埋め込みに投影し、ユーザー対話型プロンプトとして機能する軽量のクロスモーダル MLP を採用することで、元の SAM モデルを適応させてクロスモダリティ学習を強化します。
その後、パラメータ効率の高い調整戦略を採用して、言語と視覚の機能を効果的に調整し、融合させます。
包括的なアブレーション研究を通じて、当社の戦略の実践的かつ効果的な設計選択を実証します。
Ref-Youtu-VOS および Ref-DAVIS17 データセットに対して行われた広範な実験により、既存の手法に対する RefSAM モデルの優位性と有効性が検証されました。
コードとモデルは \href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM} で公開されます。

要約(オリジナル)

The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which for the first time explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Subsequently, a parameter-efficient tuning strategy is employed to effectively align and fuse the language and vision features. Through comprehensive ablation studies, we demonstrate the practical and effective design choices of our strategy. Extensive experiments conducted on Ref-Youtu-VOS and Ref-DAVIS17 datasets validate the superiority and effectiveness of our RefSAM model over existing methods. The code and models will be made publicly at \href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM}.

arxiv情報

著者	Yonglin Li,Jing Zhang,Xiao Teng,Long Lan
発行日	2023-07-03 13:21:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー