u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

要約

マルチモーダル大規模言語モデル (MLLM) の最近の進歩により、主に洗練されたモダリティ調整戦略によって視覚的理解が大幅に向上しました。
しかし、主流のアプローチは、全体的または地域的な理解を優先し、きめの細かいピクセルレベルのタスクにはあまり焦点を当てていません。
このギャップに対処するために、ピクセル、地域、およびグローバルの機能を統合して MLLM の知覚能力を向上させる革新的な統合マルチタスクフレームワークである u-LLaVA を導入します。
まず、効率的なモダリティ調整アプローチを活用し、画像とビデオの両方のデータセットを利用して、多様な視覚的コンテキストにわたるモデルの基礎的な理解を強化します。
続いて、エンドツーエンドの下流トレーニングのためのタスク固有のプロジェクターとデコーダーを使用した共同命令調整方法が示されます。
さらに、この研究は、MLLM のきめ細かい知覚能力に挑戦し、評価するために作成された、277,000 のサンプルで構成される新しいマスクベースのマルチタスクデータセットに貢献します。
全体的なフレームワークはシンプルかつ効果的で、複数のベンチマークにわたって最先端のパフォーマンスを実現します。
また、モデル、データ、コードは https://github.com/OPPOMKLab/u-LLaVA で公開されています。

要約(オリジナル)

Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize global or regional comprehension, with less focus on fine-grained, pixel-level tasks. To address this gap, we introduce u-LLaVA, an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. We commence by leveraging an efficient modality alignment approach, harnessing both image and video datasets to bolster the model’s foundational understanding across diverse visual contexts. Subsequently, a joint instruction tuning method with task-specific projectors and decoders for end-to-end downstream training is presented. Furthermore, this work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also make our model, data, and code publicly accessible at https://github.com/OPPOMKLab/u-LLaVA.

arxiv情報

著者	Jinjin Xu,Liwu Xu,Yuzhe Yang,Xiang Li,Fanyi Wang,Yanchun Xie,Yi-Jie Huang,Yaqian Li
発行日	2024-08-28 14:26:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー