mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

要約

マルチモーダル大規模言語モデル (MLLM) は、さまざまな単一イメージタスクの命令を実行する際に優れた機能を実証しています。
このような進歩にもかかわらず、長い画像シーケンスのモデル化には大きな課題が残っています。
この研究では、多用途のマルチモーダル大規模言語モデル mPLUG-Owl3 を導入します。これにより、取得された画像とテキストの知識、インターリーブされた画像とテキスト、および長いビデオが組み込まれたシナリオにおける長い画像シーケンスの理解機能が強化されます。
具体的には、視覚と言語を共通の言語誘導意味論空間に効率的に統合し、それによって拡張された複数画像シナリオの処理を容易にする新しいハイパー注意ブロックを提案します。
広範な実験結果は、mPLUG-Owl3 が単一画像、複数画像、およびビデオのベンチマークにおいて、同様のサイズのモデル間で最先端のパフォーマンスを達成していることを示唆しています。
さらに、気が散る中でも集中力を維持するモデルの能力を評価するために、気が散る耐性と呼ばれる挑戦的な長い視覚シーケンス評価を提案します。
最後に、提案されたアーキテクチャにより、mPLUG-Owl3 は超長いビジュアルシーケンス入力で優れたパフォーマンスを実証します。
私たちは、mPLUG-Owl3 がより効率的で強力なマルチモーダル大規模言語モデルの開発に貢献できることを願っています。

要約(オリジナル)

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

arxiv情報

著者	Jiabo Ye,Haiyang Xu,Haowei Liu,Anwen Hu,Ming Yan,Qi Qian,Ji Zhang,Fei Huang,Jingren Zhou
発行日	2024-08-13 08:10:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー