Unicorn: Text-Only Data Synthesis for Vision Language Model Training

要約

トレーニングビジョン言語モデル（VLM）には通常、大規模で高品質の画像テキストペアが必要ですが、そのようなデータの収集または合成には費用がかかります。
対照的に、テキストデータは豊富で安価であり、質問を促します。高品質のマルチモーダルトレーニングデータは純粋にテキストから合成できますか？
これに取り組むために、クロス統合された3段階のマルチモーダルデータ合成フレームワークを提案します。これにより、Unicorn-1.2MとUnicorn-471Kインストラクションの2つのデータセットが生成されます。
ステージ1：多様なキャプションデータ合成では、大規模な言語モデル（LLM）を使用してまばらなキャプションシードを拡大することにより、1.2mの意味的に多様な高品質のキャプションを構築します。
ステージ2：命令調整データ生成では、さらに471kのキャプションをマルチターン命令調整タスクに処理して、複雑な推論をサポートします。
最後に、ステージ3：モダリティ表現転送では、これらのテキストキャプション表現が視覚表現に変換され、さまざまな合成画像表現が生じます。
この3段階のプロセスにより、実際の画像に依存することなく、事前トレーニング用のUnicorn-1.2MおよびUnicorn-471K-instruction for destist-TuningのためのUnicorn-471Kインストラクションを構築することができます。
データの品質と多様性を維持しながら実際の画像への依存を排除することにより、私たちのフレームワークは、VLMSトレーニングのための費用対効果の高いスケーラブルなソリューションを提供します。
コードはhttps://github.com/yu-xm/unicorn.gitで入手できます。

要約(オリジナル)

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

arxiv情報

著者	Xiaomin Yu,Pengxiang Ding,Wenjie Zhang,Siteng Huang,Songyang Gao,Chengwei Qin,Kejian Wu,Zhaoxin Fan,Ziyue Qiao,Donglin Wang
発行日	2025-03-28 17:43:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー