Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

要約

このペーパーでは、ビジュアルエンコーディングと言語デコーディングを単一の LLM に統合するモノリシックマルチモーダル大規模言語モデル (MLLM) に焦点を当てます。
特に、モノリシック MLLM の既存の事前トレーニング戦略では、不安定な最適化や壊滅的な忘却が発生することが多いことがわかりました。
この問題に対処するために、私たちの中心的なアイデアは、事前トレーニングされた LLM に新しいビジュアルパラメーター空間を埋め込み、それによって LLM をフリーズしながらノイズの多いデータから視覚的な知識を安定して学習することです。
この原理に基づいて、私たちは、マルチモーダルな専門家混合構造を介して一連の視覚的専門家をシームレスに統合する、新しいモノリシック MLLM である Mono-InternVL を紹介します。
さらに、Mono-InternVL の視覚能力を最大化するための革新的な事前トレーニング戦略、すなわち Endogenous Visual Pre-training (EViP) を提案します。
特に、EViP は視覚専門家向けの進歩的な学習プロセスとして設計されており、ノイズの多いデータから高品質のデータまで視覚知識を最大限に活用することを目的としています。
私たちのアプローチを検証するために、16 のベンチマークで広範な実験を実施しました。
実験結果では、16 個のマルチモーダルベンチマークのうち 13 個で、Mono-InternVL が既存のモノリシック MLLM よりも優れたパフォーマンスを示しています (例: OCRBench で Emu3 よりも +80 ポイント)。
モジュール型ベースライン、つまり InternVL-1.5 と比較すると、Mono-InternVL は、最初のトークンのレイテンシを最大 67% 削減しながら、同等のマルチモーダルパフォーマンスを維持します。
コードとモデルは https://huggingface.co/OpenGVLab/Mono-InternVL-2B でリリースされています。

要約(オリジナル)

In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. In particular, we identify that existing pre-training strategies for monolithic MLLMs often suffer from unstable optimization or catastrophic forgetting. To address this issue, our core idea is to embed a new visual parameter space into a pre-trained LLM, thereby stably learning visual knowledge from noisy data while freezing the LLM. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results confirm the superior performance of Mono-InternVL than existing monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3 on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5, Mono-InternVL still retains comparable multimodal performance while reducing up to 67% first token latency. Code and model are released at https://huggingface.co/OpenGVLab/Mono-InternVL-2B.

arxiv情報

著者	Gen Luo,Xue Yang,Wenhan Dou,Zhaokai Wang,Jiawen Liu,Jifeng Dai,Yu Qiao,Xizhou Zhu
発行日	2024-11-20 12:15:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー