Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

要約

この作品では、画像とビデオの両方をしっかりと根拠に基づいて理解するための初の統合モデルである Sa2VA を紹介します。
特定のモダリティやタスクに限定されることが多い既存のマルチモーダル大規模言語モデルとは異なり、Sa2VA は最小限のワンショット命令チューニングで、参照セグメンテーションや会話を含む幅広い画像およびビデオタスクをサポートします。
Sa2VA は、基礎ビデオセグメンテーションモデルである SAM-2 と高度なビジョン言語モデルである LLaVA を組み合わせ、テキスト、画像、ビデオを共有 LLM トークンスペースに統合します。
Sa2VA は、LLM を使用して、SAM-2 による正確なマスクの生成をガイドする命令トークンを生成し、静的および動的ビジュアルコンテンツの両方に対する根拠に基づいたマルチモーダルな理解を可能にします。
さらに、モデルのパフォーマンスを向上させるために設計された、複雑なビデオシーン内の 72,000 を超えるオブジェクト表現を含む自動ラベル付けデータセットである Ref-SAV を紹介します。
また、複雑な環境における参照ビデオオブジェクトセグメンテーションのベンチマークを行うために、Ref-SAV データセット内の 2k ビデオオブジェクトを手動で検証します。
実験では、Sa2VA が複数のタスク、特に参照ビデオオブジェクトセグメンテーションにおいて最先端の技術を達成していることが示されており、複雑な現実世界のアプリケーションに対するその可能性が強調されています。

要約(オリジナル)

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

arxiv情報

著者	Haobo Yuan,Xiangtai Li,Tao Zhang,Zilong Huang,Shilin Xu,Shunping Ji,Yunhai Tong,Lu Qi,Jiashi Feng,Ming-Hsuan Yang
発行日	2025-01-07 18:58:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー