GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

要約

参照ビデオオブジェクトセグメンテーション (RVOS) は、ビデオ全体を通じてクエリ文によって参照されるオブジェクトをセグメント化することを目的としています。
既存の手法のほとんどは、高密度のマスクアノテーションを使用したエンドツーエンドのトレーニングを必要とし、計算量が多くなり、スケーラビリティが低下する可能性があります。
この研究では、提案されているグラウンデッドプロンプティング (GroPrompt) フレームワークを使用して、弱い監視による RVOS に対処するために基礎セグメンテーションモデルを効率的に適応させることを目指しています。
より具体的には、テキスト対比プロンプト学習（TextCon）とモダリティ対比プロンプト学習（ModalCon）を含む、ボックス監視のみで位置プロンプトと参照文の間の関連性を強化するテキスト認識プロンプト対比学習（TAP-CL）を提案します。
) それぞれフレームレベルとビデオレベルで。
提案された TAP-CL を使用すると、GroPrompt フレームワークは、ビデオから参照されるオブジェクトの位置と動きを説明する、時間的に一貫性がありながらもテキストを意識した位置プロンプトを生成できます。
標準 RVOS ベンチマーク (Ref-YouTube-VOS、Ref-DAVIS17、A2D-Sentences、および JHMDB-Sentences) の実験結果は、バウンディングボックスの弱い監視のみを与えた場合の、提案された GroPrompt フレームワークの競合パフォーマンスを示しています。

要約(オリジナル)

Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts describing locations and movements for the referred object from the video. The experimental results in the standard RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences) demonstrate the competitive performance of our proposed GroPrompt framework given only bounding box weak supervisions.

arxiv情報

著者	Ci-Siang Lin,I-Jieh Liu,Min-Hung Chen,Chien-Yi Wang,Sifei Liu,Yu-Chiang Frank Wang
発行日	2024-06-18 17:54:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー