SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

要約

参照ビデオオブジェクトセグメンテーション（RVO）は、自然言語表現に依存して、ビデオクリップにオブジェクトをセグメント化します。
既存の方法は、独立した短いクリップに推論を制限し、グローバルなコンテキストを失うか、ビデオ全体をオフラインで処理し、ストリーミング方法でアプリケーションを損なう。
この作業では、これらの制限を上回り、過去のフレームからコンテキスト情報を保持しながら、ストリーミングのようなシナリオで効果的に動作できるRVOSメソッドを設計することを目指しています。
堅牢なセグメンテーションと追跡機能を提供し、ストリーミング処理に自然に適したセグメントAnything 2（SAM2）モデルの上に構築されます。
SAM2は、重みを微調整せずに、特徴抽出段階で自然言語の理解と明示的な時間モデリングで力を与え、外部モデルにモダリティ相互作用をアウトソーシングすることなく、賢明にします。
この目的のために、特徴抽出プロセスに時間情報とマルチモーダルキューを注入する新しいアダプターモジュールを導入します。
さらに、SAM2の追跡バイアスの現象を明らかにし、現在のフレームの特徴がキャプションとより整合した新しいオブジェクトを示唆した場合に、追跡フォーカスを調整する学習可能なモジュールを提案します。
提案された方法であるSamwiseは、5 m未満のパラメーターの無視できるオーバーヘッドを追加することにより、さまざまなベンチマーク全体で最先端を達成します。
コードはhttps://github.com/claudiacuttano/samwiseで入手できます。

要約(オリジナル)

Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, SAMWISE, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of less than 5 M parameters. Code is available at https://github.com/ClaudiaCuttano/SAMWISE .

arxiv情報

著者	Claudia Cuttano,Gabriele Trivigno,Gabriele Rosi,Carlo Masone,Giuseppe Averta
発行日	2025-03-25 17:17:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー