MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

要約

クロスモーダルな対話を通じて複雑な人間の意図を理解するためのマルチモーダル大規模言語モデル (MLLM) は大幅に進歩しましたが、複雑な画像の詳細をキャプチャすることは依然として困難です。
複数のビジョンエンコーダを統合して視覚的な詳細を強化する以前の方法では、冗長性と計算オーバーヘッドが発生します。
ほとんどの MLLM は、視覚的表現にビジョンエンコーダの最後の層の特徴マップのみを利用し、浅い特徴マップ内の豊富で詳細な情報を無視していることがわかります。
この問題に対処するために、ビジョントランスフォーマー (ViT) の深い特徴と浅い特徴を効率的に統合する、シンプルかつ効果的な多層特徴フューザーである \modelname を提案します。
具体的には、意味的に調整された深い特徴をクエリとして利用して、浅い特徴から欠落している詳細を動的に抽出することで、意味的な調整を維持しながら、きめの細かい情報で表現を強化します。
LLaVA-1.5 モデルに適用された \modelname~ は、視覚表現とベンチマークパフォーマンスの大幅な向上を実現し、マルチエンコーダアンサンブル手法と比較して、より柔軟で軽量なソリューションを提供します。
コードとモデルは https://github.com/yuecao0119/MMFuser で公開されています。

要約(オリジナル)

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

arxiv情報

著者	Yue Cao,Yangzhou Liu,Zhe Chen,Guangchen Shi,Wenhai Wang,Danhuai Zhao,Tong Lu
発行日	2024-10-15 17:55:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー