Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

要約

この作品は、画像とビデオの両方を密集した理解のための最初の統一モデルであるSA2VAを示しています。
多くの場合、特定のモダリティやタスクに限定される既存のマルチモーダル大型言語モデルとは異なり、SA2VAは、セグメンテーションや会話の紹介など、最小限のワンショット命令調整を含む幅広い画像およびビデオタスクをサポートします。
SA2VAは、基礎ビデオセグメンテーションモデルであるSAM-2を、高度なビジョン言語モデルであるLlavaと組み合わせて、テキスト、画像、ビデオを共有LLMトークンスペースに統合します。
LLMを使用して、SA2VAは、SAM-2を正確なマスクの生成に導く命令トークンを生成し、静的視覚コンテンツと動的視覚コンテンツの両方の接地されたマルチモーダルの理解を可能にします。
さらに、モデルのパフォーマンスを高めるように設計された複雑なビデオシーンに72Kを超えるオブジェクト式を含む自動ラベルデータセットであるRef-Savを紹介します。
また、複雑な環境でのビデオオブジェクトセグメンテーションを参照するベンチマークに、REF-SAVデータセットの2Kビデオオブジェクトを手動で検証します。
実験は、SA2VAが複数のタスク、特にビデオオブジェクトセグメンテーションを参照する際に最先端のタスクを達成し、複雑な現実世界のアプリケーションの可能性を強調することを示しています。

要約(オリジナル)

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

arxiv情報

著者	Haobo Yuan,Xiangtai Li,Tao Zhang,Zilong Huang,Shilin Xu,Shunping Ji,Yunhai Tong,Lu Qi,Jiashi Feng,Ming-Hsuan Yang
発行日	2025-02-13 18:14:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー