MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

要約

オープンソースのマルチモーダル大規模言語モデル (MLLM) は、幅広いマルチモーダルタスクにおいて大きな可能性を示しています。
ただし、彼らの推論能力は、主に VQA、AI2D、ChartQA などの学術データセットから再利用された既存の命令チューニングデータセットによって制約されたままです。
これらのデータセットは単純なタスクを対象とし、中間の根拠を持たずにフレーズレベルの回答のみを提供します。
これらの課題に対処するために、CoT 推論を引き出すために設計された豊富な中間理論を備えた大規模なマルチモーダル命令チューニングデータセットを構築するための、スケーラブルでコスト効率の高い方法を導入します。
オープンモデルのみを使用して、1,200 万の命令と応答のペアを含むデータセットを作成し、詳細かつ忠実な理論的根拠を持つ多様で推論集中型のタスクをカバーします。
実験では、このデータセットで MLLM をトレーニングすると推論能力が大幅に向上し、MathVerse (+8.1%)、MMMU-Pro (+7%)、MuirBench (+13.3%) などのベンチマークで最先端のパフォーマンスが達成されることが実証されています。
さらに、このモデルは、非推論ベースのベンチマークで最大 4% の顕著な改善を示しています。
アブレーション研究では、データセット構築プロセスにおける書き換えや自己フィルタリングなどの主要コンポーネントの重要性がさらに強調されています。

要約(オリジナル)

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

arxiv情報

著者	Jarvis Guo,Tuney Zheng,Yuelin Bai,Bo Li,Yubo Wang,King Zhu,Yizhi Li,Graham Neubig,Wenhu Chen,Xiang Yue
発行日	2024-12-06 18:14:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー