A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

要約

マルチモーダルラージモデル (MLM) の出現により、視覚的理解の分野が大幅に進歩し、視覚的質問応答 (VQA) の分野で顕著な機能が提供されました。
しかし、真の課題は、知識集約型の VQA タスクの領域にあり、視覚要素の認識だけでなく、学習した知識の膨大なリポジトリと組み合わせて視覚情報を深く理解することも必要となります。
MLM、特に新たに導入された GPT-4V のそのような機能を明らかにするために、次の 3 つの観点から詳細な評価を提供します。1) 常識知識。モデルが視覚的な手がかりをどの程度理解し、一般知識に結び付けることができるかを評価します。
2) きめ細かい世界知識。画像から特定の知識を推論するモデルのスキルをテストし、さまざまな専門分野にわたる熟練度を示します。
3) 意思決定根拠を備えた包括的な知識。推論に対する論理的な説明を提供するモデルの能力を検証し、解釈可能性の観点からより深い分析を促進します。
広範な実験により、GPT-4V が上記 3 つのタスクで SOTA パフォーマンスを達成することが示されています。
興味深いことに、a) GPT-4V は、合成画像を少数ショットとして使用する場合に強化された推論と説明を示します。
b) GPT-4V は世界の知識を扱うときに重度の幻覚を引き起こし、この研究方向における将来の進歩の必要性を強調しています。

要約(オリジナル)

The emergence of multimodal large models (MLMs) has significantly advanced the field of visual understanding, offering remarkable capabilities in the realm of visual question answering (VQA). Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate not just recognition of visual elements, but also a deep comprehension of the visual information in conjunction with a vast repository of learned knowledge. To uncover such capabilities of MLMs, particularly the newly introduced GPT-4V, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model’s skill in reasoning out specific knowledge from images, showcasing their proficiency across various specialized fields; 3) Comprehensive Knowledge with Decision-making Rationales, which examines model’s capability to provide logical explanations for its inference, facilitating a deeper analysis from the interpretability perspective. Extensive experiments indicate that GPT-4V achieves SOTA performance on above three tasks. Interestingly, we find that: a) GPT-4V demonstrates enhanced reasoning and explanation when using composite images as few-shot; b) GPT-4V produces severe hallucinations when dealing with world knowledge, highlighting the future need for advancements in this research direction.

arxiv情報

著者	Yunxin Li,Longyue Wang,Baotian Hu,Xinyu Chen,Wanqi Zhong,Chenyang Lyu,Min Zhang
発行日	2023-11-13 18:22:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー