Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

要約

大規模言語モデル (LLM) は、特にテキストの数学的問題解決において、優れた推論能力を実証しています。
ただし、既存のオープンソースの画像命令微調整データセットは、画像ごとに制限された質問と回答のペアを含み、視覚情報を十分に活用してマルチモーダル LLM (MLLM) のマルチモーダル数学的推論機能を強化することはできません。
このギャップを埋めるために、24 の既存のデータセットから質問と回答のペアを含む 40,000 の高品質画像を収集し、320,000 の新しいペアを合成して MathV360K データセットを作成することで、高品質で多様なマルチモーダル数学データセットの不足に対処します。
そして多峰性の数学的問いの深さ。
MathV360K で微調整された LLaVA-1.5 ベースのモデルである Math-LLaVA を紹介します。
この新しいアプローチにより、LLaVA-1.5 のマルチモーダル数学的推論機能が大幅に向上し、MathVista のミニテスト分割で 19 ポイントの向上と GPT-4V と同等のパフォーマンスが達成されました。
さらに、Math-LLaVA は一般化性の向上を示し、MMMU ベンチマークの大幅な改善を示しています。
私たちの研究は、MLLM の数学的推論能力を向上させる上でのデータセットの多様性と統合の重要性を強調しています。
コードとデータは \url{https://github.com/HZQ950419/Math-LLaVA} から入手できます。

要約(オリジナル)

Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista’s minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs’ mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}.

arxiv情報

著者	Wenhao Shi,Zhiqiang Hu,Yi Bin,Junhua Liu,Yang Yang,See-Kiong Ng,Lidong Bing,Roy Ka-Wei Lee
発行日	2024-06-26 16:43:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー