Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

要約

大規模な事前トレーニング済みマルチモーダルモデルは、画像キャプション、画像とテキストの検索、視覚的質問応答 (VQA) などを含む一連の下流タスクで大きな成功を収めていることが実証されています。ただし、これらの手法の多くは、画像とテキストのペアから収集された画像とテキストのペアに依存しています。
残念なことに、画像と言語表現の詳細な理解を必要とする、視覚と言語モダリティの間のきめ細かい特徴調整の必要性が見落とされています。
VQA と高密度キャプション (DC) を事前トレーニングに統合することでこの問題に対処できますが、画像、質問、回答、および画像、位置、キャプションの 3 つの要素を取得するのは困難で時間がかかります。
さらに、VQA および高密度キャプション用に公開されているデータセットは、手動によるデータ収集とラベル付けの作業により、通常、規模が制限されます。
この論文では、Joint QA and DC GEneration (JADE) と呼ばれる新しい方法を提案します。この方法は、事前トレーニング済みのマルチモーダルモデルと、簡単にクロールできる画像とテキストのペアを利用して、大規模な VQA および高密度キャプションデータセットを自動的に生成およびフィルタリングします。
このメソッドを Conceptual Caption (CC3M) データセットに適用して、CC3M-QA-DC という新しいデータセットを生成します。
実験によると、CC3M-QA-DC をマルチタスク方式での事前トレーニングに使用すると、さまざまなバックボーンでさまざまなダウンストリームタスクのパフォーマンスを向上させることができます。
さらに、生成された CC3M-QA-DC は、より大きな画像テキストデータセット (CC15M など) と組み合わせることができ、より多くのデータを使用するモデルと比較して競争力のある結果を達成できます。
コードとデータセットは公開されます。

要約(オリジナル)

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset will be released.

arxiv情報

著者	Zikang Liu,Sihan Chen,Longteng Guo,Handong Li,Xingjian He,Jing Liu
発行日	2023-05-19 15:54:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー