Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

要約

InternVL 2.5 は、InternVL 2.0 をベースに構築された高度なマルチモーダル大規模言語モデル (MLLM) シリーズであり、そのコアモデルアーキテクチャを維持しながら、トレーニングとテスト戦略、およびデータ品質の大幅な強化を導入しています。
この研究では、モデルのスケーリングとパフォーマンスの関係を掘り下げ、ビジョンエンコーダー、言語モデル、データセットのサイズ、テスト時の構成におけるパフォーマンスの傾向を系統的に調査します。
多分野の推論、文書理解、複数画像/ビデオ理解、現実世界の理解、マルチモーダル幻覚検出、視覚グラウンディング、多言語機能、純粋言語処理を含む幅広いベンチマークでの広範な評価を通じて、InternVL 2.5 は競争力を示しています。
GPT-4o や Claude-3.5-Sonnet などの主要な商用モデルに匹敵するパフォーマンス。
特に、私たちのモデルは、MMMU ベンチマークで 70% を超えた最初のオープンソース MLLM であり、思考連鎖 (CoT) 推論を通じて 3.7 ポイントの改善を達成し、テスト時間のスケーリングの強力な可能性を示しています。
このモデルがマルチモーダル AI システムの開発と適用のための新しい標準を確立することで、オープンソースコミュニティに貢献することを願っています。
HuggingFace のデモは https://huggingface.co/spaces/OpenGVLab/InternVL を参照してください。

要約(オリジナル)

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

arxiv情報

著者	Zhe Chen,Weiyun Wang,Yue Cao,Yangzhou Liu,Zhangwei Gao,Erfei Cui,Jinguo Zhu,Shenglong Ye,Hao Tian,Zhaoyang Liu,Lixin Gu,Xuehui Wang,Qingyun Li,Yimin Ren,Zixuan Chen,Jiapeng Luo,Jiahao Wang,Tan Jiang,Bo Wang,Conghui He,Botian Shi,Xingcheng Zhang,Han Lv,Yi Wang,Wenqi Shao,Pei Chu,Zhongying Tu,Tong He,Zhiyong Wu,Huipeng Deng,Jiaye Ge,Kai Chen,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang
発行日	2024-12-06 18:57:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー