MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

要約

マルチモーダル大規模言語モデル (MLLM) は、さまざまな視覚的理解タスクにおいて大幅な進歩を遂げました。
ただし、これらのモデルの大部分は低解像度の画像を処理するように制約されているため、詳細な視覚情報を必要とする知覚タスクにおける有効性が制限されます。
私たちの研究では、低解像度、高解像度、オブジェクト中心の機能を含む多粒度ビジョンフローを組み込むことでモデルの視覚処理能力を強化する革新的な MLLM である MG-LLaVA を紹介します。
私たちは、追加の高解像度ビジュアルエンコーダーを統合して、きめ細かい詳細をキャプチャし、Conv-Gate フュージョンネットワークを通じて基本ビジュアル機能と融合することを提案します。
モデルのオブジェクト認識能力をさらに改良するために、オフライン検出器によって識別された境界ボックスから派生したオブジェクトレベルの特徴を組み込みます。
MG-LLaVA は、命令チューニングを通じて公開されているマルチモーダルデータのみでトレーニングされ、優れた認識スキルを発揮します。
3.8B から 34B までのさまざまな言語エンコーダーを使用して MG-LLaVA をインスタンス化し、モデルのパフォーマンスを総合的に評価します。
複数のベンチマークにわたる広範な評価により、MG-LLaVA が同等のパラメーターサイズの既存の MLLM よりも優れたパフォーマンスを示し、その顕著な有効性が実証されました。
コードは https://github.com/PhoenixZ810/MG-LLaVA で入手できます。

要約(オリジナル)

Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model’s visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model’s object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model’s performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.

arxiv情報

著者	Xiangyu Zhao,Xiangtai Li,Haodong Duan,Haian Huang,Yining Li,Kai Chen,Hua Yang
発行日	2024-06-25 17:55:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー