InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

要約

マルチモーダル大規模言語モデル (MLLM) によって後押しされ、画像およびビデオ領域のテキストガイドによるユニバーサルセグメンテーションモデルが最近急速に進歩しました。
ただし、これらの手法は特定のドメインごとに個別に開発されることが多く、これら 2 つの領域にわたるタスク設定とソリューションの類似点が見落とされます。
この論文では、画像とビデオの両方のレベルでの参照セグメンテーションと推論セグメンテーションの結合を、指示された視覚セグメンテーション (IVS) として定義します。
これに対応して、IVS 用の MLLM を備えたエンドツーエンドのセグメンテーションパイプラインである InstructSeg を提案します。
具体的には、オブジェクト認識ビデオ知覚装置を採用して、参照フレームから時間情報とオブジェクト情報を抽出し、包括的なビデオの理解を促進します。
さらに、ビジョンガイドによる多粒度テキストフュージョンを導入し、グローバルで詳細なテキスト情報をきめ細かいビジュアルガイダンスとより適切に統合します。
InstructSeg は、マルチタスクとエンドツーエンドのトレーニングを活用することで、さまざまな画像およびビデオのセグメンテーションタスクにわたって優れたパフォーマンスを示し、単一モデルでセグメンテーションスペシャリストと MLLM ベースのメソッドの両方を上回ります。
私たちのコードは https://github.com/congvvc/InstructSeg で入手できます。

要約(オリジナル)

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model. Our code is available at https://github.com/congvvc/InstructSeg.

arxiv情報

著者	Cong Wei,Yujie Zhong,Haoxian Tan,Yingsen Zeng,Yong Liu,Zheng Zhao,Yujiu Yang
発行日	2024-12-18 16:20:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー