Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

要約

複雑な視覚情報を正確に解釈する機能は、マルチモーダル大規模言語モデル (MLLM) の重要なトピックです。
最近の研究では、視覚認識が強化されると幻覚が大幅に軽減され、光学式文字認識や文書分析など、解像度に依存するタスクのパフォーマンスが向上することが示されています。
最近の MLLM の多くは、ビジョンエンコーダを組み合わせて使用してこの目標を達成しています。
成功にもかかわらず、専門家の選択や複数の視覚専門家の統合などの重要な側面に対処する体系的な比較や詳細なアブレーション研究が不足しています。
この研究では、ビジョンエンコーダと解像度を組み合わせて使用する MLLM の設計空間の広範な調査を提供します。
私たちの調査結果は、さまざまな既存の戦略に共通するいくつかの基礎的な原則を明らかにし、合理化された効果的な設計アプローチにつながります。
私たちは、相補的なビジョンエンコーダーのセットからビジュアルトークンを単純に連結するだけでも、より複雑なアーキテクチャや戦略を混合するのと同じくらい効果的であることを発見しました。
さらに、ビジョンに重点を置いたエンコーダーと言語トークンの間のギャップを埋めるために事前アライメントを導入し、モデルの一貫性を強化します。
結果として得られた MLLM ファミリである Eagle は、主要な MLLM ベンチマークで他の主要なオープンソースモデルを上回っています。
モデルとコード: https://github.com/NVlabs/Eagle

要約(オリジナル)

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle

arxiv情報

著者	Min Shi,Fuxiao Liu,Shihao Wang,Shijia Liao,Subhashree Radhakrishnan,De-An Huang,Hongxu Yin,Karan Sapra,Yaser Yacoob,Humphrey Shi,Bryan Catanzaro,Andrew Tao,Jan Kautz,Zhiding Yu,Guilin Liu
発行日	2024-08-28 17:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー