Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

要約

大規模な言語モデルは、知識集約型の質問に答える際の新しい能力を示しています。
Web スケールのビジュアルおよび言語の事前トレーニングに関する最近の進歩により、これらのモデルは、視覚情報を求める質問に答える方法も理解していますか?
この質問に答えるために、常識的な知識では答えられない情報を求める質問をすることに焦点を当てた視覚的質問応答データセットである InfoSeek を紹介します。
質問と回答のペアを求める高品質の視覚情報の自然な分布を収集するために、多段階のヒューマンアノテーションを実行します。
また、既存の視覚エンティティ認識データセットと Wikidata を組み合わせて、自動的に収集された大規模なデータセットを構築します。これにより、モデルの微調整と検証のための 100 万を超える例が提供されます。
InfoSeek に基づいて、さまざまな事前トレーニング済みの Visual QA システムを分析し、さまざまな事前トレーニング済みモデルの特性に関する洞察を得ました。
私たちの分析によると、最先端のマルチモーダルな事前トレーニング済みモデルでは、視覚情報を求める質問に答えるのは困難ですが、この機能は、自動化された InfoSeek データセットを微調整することで改善されます。
私たちの分析が、次世代のマルチモーダル事前トレーニングを理解し、開発する道を開くことを願っています。

要約(オリジナル)

Large language models have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training, do these models also understand how to answer visual information seeking questions? To answer this question, we present InfoSeek, a Visual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking question-answer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation. Based on InfoSeek, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models. Our analysis shows that it is challenging for the state-of-the-art multi-modal pre-trained models to answer visual information seeking questions, but this capability is improved through fine-tuning on the automated InfoSeek dataset. We hope our analysis paves the way to understand and develop the next generation of multi-modal pre-training.

arxiv情報

著者	Yang Chen,Hexiang Hu,Yi Luan,Haitian Sun,Soravit Changpinyo,Alan Ritter,Ming-Wei Chang
発行日	2023-02-23 00:33:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー