Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

要約

大規模ビジョン言語モデル (LVLM) は、事前トレーニングされたビジョンエンコーダと大規模言語モデルを組み合わせることで、マルチモーダルタスクで大きな成功を収めています。
ただし、現在の LVLM は主にビジョンエンコーダの最終層の機能に依存しており、より浅い層の補完情報は無視されています。
最近の手法ではマルチレイヤー機能が検討されていますが、多くの場合タスクに依存しません。
18 のベンチマークと 6 つのタスクカテゴリにわたって、さまざまなエンコーダーレイヤからの視覚的特徴の寄与を調査します。
私たちの結果は、多層機能がさまざまなタスクの依存関係で補完的な強みを提供し、均一な融合が最適に機能しないことを示しています。
これらの発見に基づいて、ビジュアルトークンの数を増やすことなく、テキストの指示に基づいてマルチレイヤー機能を動的に統合する、指示ガイド付きビジョンアグリゲーターを提案します。
広範な評価により優れたパフォーマンスが示され、分析により、意味論的なタスクにおける中レベルから高レベルの機能の優位性と、きめ細かい認識における低レベルの機能の重要な役割が明らかになりました。
この研究は、LVLM の階層的な視覚機能の適応的な使用に関する貴重な洞察を提供し、より柔軟なマルチモーダルシステムを進歩させます。

要約(オリジナル)

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks by combining pre-trained vision encoders and large language models. However, current LVLMs mainly rely on features from the final layers of the vision encoder, neglecting complementary information in shallower layers. While recent methods have explored multi-layer features, they are often task-agnostic. We investigate the contributions of visual features from different encoder layers across 18 benchmarks and 6 task categories. Our results show that multi-layer features provide complementary strengths with varying task dependencies, and uniform fusion performs suboptimally. Based on these findings, we propose an instruction-guided vision aggregator that dynamically integrates multi-layer features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations show superior performance, and analysis reveals the dominance of mid-to-high-level features in semantic tasks and the critical role of low-level features in fine-grained perception. This work provides valuable insights into the adaptive use of hierarchical visual features in LVLMs, advancing more flexible multimodal systems.

arxiv情報

著者	Xu Li,Yi Zheng,Haotian Chen,Xiaolei Chen,Yuxuan Liang,Chenghang Lai,Bin Li,Xiangyang Xue
発行日	2025-01-16 12:06:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー