M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

要約

命令チューニングでは、ChatGPT などの大規模言語モデル (LLM) が大幅に進歩し、さまざまなタスクにわたって人間の命令と一致させることができます。
しかし、高品質の命令データセットが不足しているため、オープンビジョン言語モデル (VLM) の進歩は限られています。
この課題に取り組み、視覚言語分野の研究を促進するために、人間の指示と VLM の調整を最適化するように設計されたマルチモーダル、多言語指示チューニング (M$^3$IT) データセットを導入します。
当社の M$^3$IT データセットは、ビジョンからテキストへの構造に再フォーマットされた、240 万個のインスタンスと 400 個の手動で作成されたタスク指示を含む、慎重に精選された 40 個のデータセットで構成されています。
主要なタスクは高度な翻訳システムを使用して 80 言語に翻訳され、より幅広いアクセシビリティが保証されます。
M$^3$IT は、タスクカバレッジ、命令数、インスタンスの規模に関して以前のデータセットを上回っています。
さらに、M$^3$IT データセットでトレーニングされた VLM モデルである Ying-VLM を開発し、世界の知識を必要とする複雑な質問に答え、目に見えないビデオタスクに一般化して、目に見えない中国語の指示を理解する可能性を示しています。
さらなる研究を促進するために、データセットをオープンソース化しました。

要約(オリジナル)

Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to optimize VLM alignment with human instructions. Our M$^3$IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M$^3$IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M$^3$IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. We have open-sourced the dataset to encourage further research.

arxiv情報

著者	Lei Li,Yuwei Yin,Shicheng Li,Liang Chen,Peiyi Wang,Shuhuai Ren,Mukai Li,Yazheng Yang,Jingjing Xu,Xu Sun,Lingpeng Kong,Qi Liu
発行日	2023-06-08 13:44:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー