InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

要約

Native Multimodal Pre-Trainingパラダイムを備えたInterNVLシリーズの大幅な進歩であるInternVL3を紹介します。
視覚入力をサポートするマルチモーダル大手言語モデル（MLLM）にテキストのみの大型言語モデル（LLM）を適応させるのではなく、InterNVL3は、単一の貿易前の段階で多様なマルチモーダルデータと純粋なテキストコーパスの両方からマルチモーダルおよび言語的機能を共同で取得します。
この統一されたトレーニングパラダイムは、MLLMの従来の事後トレーニングパイプラインで一般的に遭遇する複雑さとアラインメントの課題に効果的に対処します。
パフォーマンスとスケーラビリティをさらに向上させるために、INTERNVL3には、可変視覚位置エンコーディング（V2PE）が組み込まれて、拡張されたマルチモーダルコンテキストをサポートし、監視付き微調整（SFT）や混合好みの最適化（MPO）などの高度なトレーニングテクニックを採用し、最適化されたトレーニングインフラストラクチャとともにテストタイムのスケーリング戦略を採用します。
広範な経験的評価は、InternVL3が幅広いマルチモーダルタスクで優れたパフォーマンスを提供することを示しています。
特に、InternVL3-78BはMMMUベンチマークで72.2のスコアを達成し、オープンソースMLLMの間で新しい最先端を設定します。
その機能は、ChatGPT-4o、Claude 3.5 Sonnet、Gemini 2.5 Proなど、主要な独自モデルと非常に競争力があり、強力な純粋な能力を維持しています。
オープンサイエンスの原則を追求するために、トレーニングデータとモデルの重量の両方を公開して、次世代のMLLMのさらなる研究開発を促進します。

要約(オリジナル)

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

arxiv情報

著者	Jinguo Zhu,Weiyun Wang,Zhe Chen,Zhaoyang Liu,Shenglong Ye,Lixin Gu,Yuchen Duan,Hao Tian,Weijie Su,Jie Shao,Zhangwei Gao,Erfei Cui,Yue Cao,Yangzhou Liu,Weiye Xu,Hao Li,Jiahao Wang,Han Lv,Dengnian Chen,Songze Li,Yinan He,Tan Jiang,Jiapeng Luo,Yi Wang,Conghui He,Botian Shi,Xingcheng Zhang,Wenqi Shao,Junjun He,Yingtong Xiong,Wenwen Qu,Peng Sun,Penglong Jiao,Lijun Wu,Kaipeng Zhang,Huipeng Deng,Jiaye Ge,Kai Chen,Limin Wang,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang
発行日	2025-04-14 17:59:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー