Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

要約

このペーパーでは、オーディオおよび言語参照ビデオオブジェクトセグメンテーション、つまり AVS および RVOS タスクのためのトレーニング不要のパラダイムを探索するための、オーディオ言語参照 SAM 2 (AL-Ref-SAM 2) パイプラインを提案します。
この直感的なソリューションは、GroundingDINO を利用して単一フレームからターゲットオブジェクトを識別し、SAM 2 を利用してビデオ全体で識別されたオブジェクトをセグメント化します。これは、ビデオコンテキストの探索が不足しているため、時空間変動に対する堅牢性が低くなります。
したがって、AL-Ref-SAM 2 パイプラインでは、ピボットフレームとピボットボックスを順次選択するための 2 ステップの時間空間推論を実行するように GPT-4 に指示する、新しい GPT 支援ピボット選択 (GPT-PS) モジュールを提案します。
これにより、SAM 2 に高品質の初期オブジェクトプロンプトが提供されます。
GPT-PS 内では、タスク固有の 2 つの思考連鎖プロンプトが、ビデオと参考情報の包括的な理解に基づいて選択を行うように GPT を誘導することで、GPT の時空間推論能力を解放するように設計されています。
さらに、オーディオ信号を言語形式のリファレンスに変換する言語バインドリファレンス統合 (LBRU) モジュールを提案します。これにより、同じパイプライン内の AVS タスクと RVOS タスクの形式を統一します。
両方のタスクに関する広範な実験により、トレーニング不要の AL-Ref-SAM 2 パイプラインが、完全に監視された微調整手法と同等、またはそれ以上のパフォーマンスを達成できることがわかりました。
コードは https://github.com/appletea233/AL-Ref-SAM2 から入手できます。

要約(オリジナル)

In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT’s temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.

arxiv情報

著者	Shaofei Huang,Rui Ling,Hongyu Li,Tianrui Hui,Zongheng Tang,Xiaoming Wei,Jizhong Han,Si Liu
発行日	2024-08-28 15:47:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー