IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

要約

ビジョン言語モデル (VLM) の出現により、研究者は自然言語を使用してニューラルネットワークの視覚的理解を調査できるようになりました。
VLM は、オブジェクトの分類と検出を超えて、視覚的な理解と常識的な推論が可能です。
これは当然、次のような疑問につながりました。画像自体が本質的に不合理な場合、VLM はどのように対応するのでしょうか?
この目的のために、我々は IllusionVQA を提示します。これは、理解とソフトローカリゼーションという 2 つの異なる複数選択 VQA タスクにおける VLM の機能をテストするための、挑戦的な目の錯覚と解釈が難しいシーンの多様なデータセットです。
最もパフォーマンスの高い VLM である GPT4V は、理解タスクで 62.99% (4 ショット)、ローカリゼーションタスク (4 ショットおよび思考連鎖) で 49.7% の精度を達成しています。
人間による評価では、人間の理解と位置特定の精度は 91.03% および 100% であることが明らかになりました。
私たちは、コンテキスト内学習 (ICL) と思考連鎖推論が、ローカリゼーションタスクにおける Gemini-Pro のパフォーマンスを大幅に低下させることを発見しました。
話は変わりますが、VLM の ICL 機能には潜在的な弱点があることがわかりました。VLM は、数ショットの例として正しい答えがコンテキストウィンドウ内にある場合でも、目の錯覚を見つけることができません。

要約(オリジナル)

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks – comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

arxiv情報

著者	Haz Sameen Shahgir,Khondker Salman Sayeed,Abhik Bhattacharjee,Wasi Uddin Ahmad,Yue Dong,Rifat Shahriyar
発行日	2024-08-09 14:26:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー