Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

要約

反事実的推論は、人間の知性の重要な現れとして、確立された事実に基づいて前提を立て、潜在的な結果を推定することを指します。
既存のマルチモーダル大規模言語モデル (MLLM) は、広範囲の Visual Question Answering (VQA) ベンチマークにわたって検査され、優れた認知機能と推論機能を示しています。
それにもかかわらず、反事実的な質問に直面した場合、既存の MLLM はどのように行動するでしょうか?
この質問に答えるために、まず新しい \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal 推論ベンチマーク (\textbf{CFMM} と略称) を作成し、反事実を体系的に評価します。
MLLM の推論能力。
私たちの CFMM は 6 つの挑戦的なタスクで構成されており、各タスクには、さまざまな側面にわたって MLLM の反事実推論能力を評価するために、人間が慎重にラベル付けした何百もの反事実質問が含まれています。
実験を通じて、興味深いことに、既存の MLLM は自分が見ているものを信じることを好むが、質問で提示された反事実的な前提を無視するため、不正確な応答につながることがわかりました。
さらに、私たちは提案した CFMM で広く普及している MLLM を評価します。
CFMM でのパフォーマンスといくつかの VQA ベンチマークでのパフォーマンスとの間に大きな差があることは、人間レベルの知能に近づくために既存の MLLM に改善の余地がまだかなりあることを示しています。
一方で、将来的に CFMM で MLLM のパフォーマンスを向上させることで、高度なインテリジェンスを備えた MLLM の開発に向けた潜在的な道を探ることができます。

要約(オリジナル)

Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM’s counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.

arxiv情報

著者	Yian Li,Wentao Tian,Yang Jiao,Jingjing Chen,Yu-Gang Jiang
発行日	2024-04-19 15:53:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー