Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

要約

タイトル：Visual Question Answeringにおける評価方法の再評価：Out-of-Distribution Generalizationに関するケーススタディ

要約：
– 多様なデータで事前トレーニングされたビジョンアンドランゲージ（V＆L）モデルは、画像キャプションやビジュアルクエスチョンアンサリング（VQA）などのタスクで強力なパフォーマンスを発揮している。
– これらのモデルの品質は、一般的にトレーニングデータと同じ分布から来る未知のデータのパフォーマンスを測定することで評価される。
– しかし、VQAのOut-of-Distribution（データセット外）セッティングで評価すると、これらのモデルは一般的にパフォーマンスが低下する。
– 本研究では、cross-dataset evaluationsを実施し、2つの事前トレーニングされたV＆Lモデルを評価する。多様なセッティングで（分類やオープンエンドのテキスト生成）、これらのモデルは、VQAタスクに必要な高度なスキルを学ぶ代わりに、ベンチマークを解決することを学びます。
– さらに、一般にジェネラティブモデルは判別的モデルより、データ分布のシフトに対して少ない影響を受けることがわかりました。
– 自動VQA評価メトリックの使用に関する前提条件を再検討し、厳密な性質を持つこれらのメトリックが、正しい応答に対してモデルを繰り返しペナルティすることを経験的に示しています。

要約(オリジナル)

Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

arxiv情報

著者	Aishwarya Agrawal,Ivana Kajić,Emanuele Bugliarello,Elnaz Davoodi,Anita Gergely,Phil Blunsom,Aida Nematzadeh
発行日	2023-04-01 07:07:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー