Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

要約

複雑な視覚情報を正確に解釈する能力は、マルチモーダル大規模言語モデル（MLLM）の重要なテーマである。最近の研究では、視覚認識を強化することで、幻覚が大幅に減少し、光学式文字認識や文書解析のような解像度に敏感なタスクの性能が向上することが示されている。最近のMLLMの多くは、視覚エンコーダの混合を用いてこの目標を達成している。これらの成功にもかかわらず、専門家の選択や複数の視覚専門家の統合などの重要な側面を扱った体系的な比較や詳細なアブレーション研究が不足している。本研究は、視覚エンコーダと解像度の混合を用いたMLLMの設計空間の広範な探索を提供する。我々の発見は、既存の様々な戦略に共通するいくつかの基本原理を明らかにし、合理的かつ効果的な設計アプローチへと導く。我々は、相補的なビジョンエンコーダの集合からの視覚トークンを単純に連結することが、より複雑な混合アーキテクチャや戦略と同じくらい効果的であることを発見した。さらに、視覚に特化したエンコーダーと言語トークンとの間のギャップを埋めるために、プレアライメントを導入し、モデルの一貫性を強化する。その結果、EagleというMLLMファミリーは、主要なMLLMベンチマークにおいて、他の主要なオープンソースモデルを凌駕しています。

要約(オリジナル)

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.

arxiv情報

著者	Min Shi,Fuxiao Liu,Shihao Wang,Shijia Liao,Subhashree Radhakrishnan,Yilin Zhao,De-An Huang,Hongxu Yin,Karan Sapra,Yaser Yacoob,Humphrey Shi,Bryan Catanzaro,Andrew Tao,Jan Kautz,Zhiding Yu,Guilin Liu
発行日	2025-03-02 23:41:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー