SVIT: Scaling up Visual Instruction Tuning

要約

基礎モデルの登場のおかげで、大規模な言語モデルと視覚モデルが統合され、視覚的なキャプションや質問応答などのマルチモーダルな能力が獲得されています。既存のマルチモーダルモデルは、視覚的な理解と推論において優れたパフォーマンスを示していますが、その限界は依然として大幅に下回っています。
高品質の命令チューニングデータが不足しているため、調査は困難でした。
マルチモーダル機能の限界を押し上げるために、160 万の会話質問と回答 (QA) ペア、160 万の複雑な推論 QA ペア、100 万の参照を含む 420 万のビジュアル命令チューニングデータセットを構築することで、ビジュアル命令チューニング (SVIT) をスケールアップします。
QA ペアと 106K の詳細な画像説明。
提案されたデータセットは、量に加えて、GPT-4 に画像の豊富な手動アノテーションを促すことによって生成される、高品質で豊かな多様性によっても特徴付けられます。
また、モデルの優れた機能を呼び起こす、より優れた多様性とバランスを持つサブセットを選択するための新しいデータレシピも提案します。
広範な実験により、提案されたデータセットでトレーニングされた SVIT-v1.5 が、一般的なベンチマークで最先端のマルチモーダル大規模言語モデルを上回るパフォーマンスを示すことが検証されています。
データとコードは https://github.com/BAAI-DCAI/Visual-struction-Tuning で公開されています。

要約(オリジナル)

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Scale up Visual Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We also propose a new data recipe to select subset with better diversity and balance, which evokes model’s superior capabilities. Extensive experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks. The data and code are publicly available at https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.

arxiv情報

著者	Bo Zhao,Boya Wu,Muyang He,Tiejun Huang
発行日	2023-12-28 16:01:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SVIT: Scaling up Visual Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー