Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

要約

最近、ビジョン言語モデル (VLM) はマルチモーダルタスクにおいて目覚ましい進歩を遂げており、マルチモーダルな命令データは VLM の機能を強化するための基盤として機能します。
いくつかのオープンソースのマルチモーダルデータセットが利用可能であるにもかかわらず、オープンソースの命令データの規模と品質の制限により、これらのデータセットでトレーニングされた VLM のパフォーマンスが妨げられ、クローズドソースデータでトレーニングされたモデルと比較して大きなギャップが生じます。
この課題に対処するために、大規模なマルチモーダル命令データセットである Infinity-MM を導入します。
利用可能なマルチモーダルな命令データセットを収集し、統合前処理を実行した結果、多様性と精度を保証する 4,000 万を超えるサンプルを含むデータセットが得られました。
さらに、命令データの大規模な拡張を可能にし、高品質なデータの継続的な取得をサポートするために、タグ付けシステムとオープンソースの VLM に基づく合成命令生成手法を提案します。
この方法は、さまざまなタイプの画像と関連する命令タイプの間の対応を確立することにより、データ合成中に重要なガイダンスを提供できます。
この高品質データを活用して、20 億パラメータの視覚言語モデル Aquila-VL-2B をトレーニングしました。これは、同様の規模のモデルの中で最先端 (SOTA) パフォーマンスを達成します。
データは https://huggingface.co/datasets/BAAI/Infinity-MM で入手できます。

要約(オリジナル)

Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.

arxiv情報

著者	Shuhao Gu,Jialing Zhang,Siyuan Zhou,Kevin Yu,Zhaohu Xing,Liangdong Wang,Zhou Cao,Jintao Jia,Zhuoyi Zhang,Yixuan Wang,Zhenchong Hu,Bo-Wen Zhang,Jijie Li,Dong Liang,Yingli Zhao,Songjing Wang,Yulong Ao,Yiming Ju,Huanhuan Ma,Xiaotong Li,Haiwen Diao,Yufeng Cui,Xinlong Wang,Yaoqi Liu,Fangxiang Feng,Guang Liu
発行日	2025-01-06 12:48:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー