MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

要約

MM1.5 は、テキストが豊富な画像の理解、視覚的な参照とグラウンディング、および複数画像の推論の機能を強化するように設計されたマルチモーダル大規模言語モデル (MLLM) の新しいファミリーです。
MM1 アーキテクチャに基づいて構築された MM1.5 は、モデルトレーニングにデータ中心のアプローチを採用し、モデルトレーニングのライフサイクル全体にわたる多様なデータ混合の影響を体系的に調査します。
これには、継続的な事前トレーニングのための高品質の OCR データと合成キャプション、および監視付き微調整のための最適化された視覚的命令調整データの混合が含まれます。
私たちのモデルの範囲は 1B から 30B までのパラメーターで、高密度および専門家混合 (MoE) の両方のバリエーションを包含しており、慎重なデータキュレーションとトレーニング戦略により小規模 (1B および 3B) であっても強力なパフォーマンスを生み出すことができることを示しています。
さらに、ビデオを理解するために設計された MM1.5-Video と、モバイル UI を理解するために調整された MM1.5-UI という 2 つの特殊なバリアントを導入します。
広範な実証研究とアブレーションを通じて、最終的なデザインに影響を与えるトレーニングプロセスと決定に関する詳細な洞察を提供し、MLLM 開発における将来の研究に貴重な指針を提供します。

要約(オリジナル)

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

arxiv情報

著者	Haotian Zhang,Mingfei Gao,Zhe Gan,Philipp Dufter,Nina Wenzel,Forrest Huang,Dhruti Shah,Xianzhi Du,Bowen Zhang,Yanghao Li,Sam Dodge,Keen You,Zhen Yang,Aleksei Timofeev,Mingze Xu,Hong-You Chen,Jean-Philippe Fauconnier,Zhengfeng Lai,Haoxuan You,Zirui Wang,Afshin Dehghan,Peter Grasch,Yinfei Yang
発行日	2024-09-30 17:59:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー