LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

要約

強化された視覚的理解は、マルチモーダル大規模言語モデル (MLLM) の基礎として機能します。
最近のハイブリッド MLLM には、単一のビジョンエンコーダと過度に長いビジュアルトークンの使用の制限に対処するために、複数のビジョン専門家が組み込まれています。
これらの MLLM の進歩にも関わらず、多様なビジョンエンコーダを効果的に統合するには研究上のギャップが残っています。
この研究では、ハイブリッド MLLM のビジュアルトークンの融合戦略を検討し、入力画像のセグメント化されたタイルごとに、ポスト適応融合戦略と適応タイリングを組み込んだデュアルブランチビジョンエンコーダフレームワークを備えた新しい MLLM である LEO の設計につながります。
, LEO は、2 つのビジョンエンコーダーからのビジュアルトークンを順次インターリーブします。
13 のビジョン言語ベンチマークにわたる広範な評価により、LEO がほとんどのタスクで最先端のオープンソース MLLM およびハイブリッド MLLM を上回るパフォーマンスを示していることが明らかになりました。
さらに、モデルアーキテクチャやトレーニングレシピを変更することなく、LEO を自動運転の特殊な領域に適応させ、既存のベースラインと比較して競争力のあるパフォーマンスを達成できることを示します。
コードとモデルは公開されます。

要約(オリジナル)

Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

arxiv情報

著者	Mozhgan Nasr Azadani,James Riddell,Sean Sedwards,Krzysztof Czarnecki
発行日	2025-01-13 00:29:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー