LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

要約

視覚的な命令のチューニングは、大規模マルチモーダルモデル (LMM) の機能強化において大幅な進歩を遂げました。
ただし、既存のオープン LMM は主に単一イメージのタスクに焦点を当てており、複数イメージのシナリオへの応用はまだあまり検討されていません。
さらに、以前の LMM 研究ではさまざまなシナリオに個別に取り組んでいたため、新たな機能を備えた複数のシナリオを一般化することは不可能でした。
この目的を達成するために、LMM でマルチイメージ、マルチフレーム (ビデオ)、マルチビュー (3D)、およびマルチパッチ (単一イメージ) のシナリオに同時に取り組む LLaVA-NeXT-Interleave を導入します。
これらの機能を有効にするために、インターリーブデータ形式を一般的なテンプレートと見なし、14 のタスクと 41 のデータセットを含む 4 つのプライマリドメインにまたがる 1,177.6k サンプルを含む M4-Instruct データセットをコンパイルします。
また、LMM のマルチイメージパフォーマンスを包括的に評価するための LLaVA-Interleave Bench も厳選しています。
広範な実験を通じて、LLaVA-NeXT-Interleave は、単一画像タスクのパフォーマンスを維持しながら、複数画像、ビデオ、および 3D ベンチマークで優れた結果を達成しました。
さらに、私たちのモデルは、さまざまな設定やモダリティ間でタスクを転送するなど、いくつかの新しい機能も示しています。
コードは https://github.com/LLaVA-VL/LLaVA-NeXT で入手できます。

要約(オリジナル)

Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

arxiv情報

著者	Feng Li,Renrui Zhang,Hao Zhang,Yuanhan Zhang,Bo Li,Wei Li,Zejun Ma,Chunyuan Li
発行日	2024-07-10 17:59:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー