ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

要約

マルチモーダルアプリケーションの台頭により、複雑な画像ベースのクエリを理解できるマルチモーダル言語モデルをトレーニングするために、命令データが重要になっています。
既存の手法では、命令データを生成するために、強力ではあるがコストのかかる大規模言語モデル (LLM) またはマルチモーダル言語モデル (MLM) に依存しています。
これらは幻覚やライセンスの問題が発生しやすく、生成プロセスの拡張や解釈が難しいことがよくあります。
この研究では、画像の記号表現としてシーングラフと人間が作成したプログラムを使用して、視覚中心の指示データを体系的に合成するプログラム的なアプローチを紹介します。
私たちのアプローチは、データ生成プロセスの解釈可能性と制御可能性を保証し、事実の正確さを維持しながら効率的に拡張します。
24 個の単一イメージ、14 個のマルチイメージ命令ジェネレーター、およびシーングラフ生成パイプラインのスイートを実装することで、スケーラブルでコスト効率の高いシステム、つまりオブジェクト、属性、関係、深度に関する多様な質問と回答のペアを生成する ProVision を構築します。
、など、任意の画像に対して。
Visual Genome および DataComp データセットに適用すると、1,000 万を超える命令データポイント、ProVision-10M が生成され、MLM の事前トレーニング段階と命令調整段階の両方で活用されます。
命令チューニング段階で採用された場合、当社の単一イメージ命令データは、CVBench の 2D スプリットで最大 7%、3D スプリットで最大 8% 向上し、QBench2、RealWorldQA、および MMMU でのパフォーマンスが 3% 向上します。
。
弊社のマルチイメージ命令データにより、Mantis-Eval が 8% 向上しました。
xGen-MM-4B の事前トレーニング段階と微調整段階の両方にデータを組み込むことで、11 のベンチマーク全体で平均 1.6% の改善が実現しました。

要約(オリジナル)

With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. These are often prone to hallucinations, licensing issues and the generation process is often hard to scale and interpret. In this work, we present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy. By implementing a suite of 24 single-image, 14 multi-image instruction generators, and a scene graph generation pipeline, we build a scalable, cost-effective system: ProVision which produces diverse question-answer pairs concerning objects, attributes, relations, depth, etc., for any given image. Applied to Visual Genome and DataComp datasets, we generate over 10 million instruction data points, ProVision-10M, and leverage them in both pretraining and instruction tuning stages of MLMs. When adopted in the instruction tuning stage, our single-image instruction data yields up to a 7% improvement on the 2D split and 8% on the 3D split of CVBench, along with a 3% increase in performance on QBench2, RealWorldQA, and MMMU. Our multi-image instruction data leads to an 8% improvement on Mantis-Eval. Incorporation of our data in both pre-training and fine-tuning stages of xGen-MM-4B leads to an averaged improvement of 1.6% across 11 benchmarks.

arxiv情報

著者	Jieyu Zhang,Le Xue,Linxin Song,Jun Wang,Weikai Huang,Manli Shu,An Yan,Zixian Ma,Juan Carlos Niebles,silvio savarese,Caiming Xiong,Zeyuan Chen,Ranjay Krishna,Ran Xu
発行日	2024-12-11 18:28:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー