VAGUE: Visual Contexts Clarify Ambiguous Expressions

要約

人間のコミュニケーションは、多くの場合、曖昧さを解決するために視覚的な手がかりに依存しています。
人間はこれらの手がかりを直感的に統合することができますが、AIシステムは洗練されたマルチモーダル推論に従事することが困難なことがよくあります。
Vagueを紹介します。これは、マルチモーダルAIシステムの視覚的コンテキストを意図的な乱用のために統合する能力を評価するベンチマークを紹介します。
あいまいなものは、1.6Kのあいまいなテキスト式で構成されており、それぞれが画像と複数選択解釈と組み合わされており、正解は視覚的なコンテキストでのみ明らかです。
データセットは、段階的で複雑な（視覚的な常識的な推論）と自然な個人的な（eGo4D）シーンの両方に及び、多様性を確保します。
私たちの実験は、既存のマルチモーダルAIモデルがスピーカーの真の意図を推測するのに苦労していることを明らかにしています。
パフォーマンスはより視覚的な手がかりの導入から一貫して改善されますが、全体的な精度は人間のパフォーマンスをはるかに下回り、マルチモーダル推論の重要なギャップを強調しています。
故障症例の分析は、現在のモデルが真の意図を視覚シーンの表面的な相関と区別できないことを示しており、それらが画像を認識しているが、効果的に推論しないことを示しています。
https://github.com/hazel-heejeong-nam/vague.gitでコードとデータをリリースします。

要約(オリジナル)

Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems’ ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker’s true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://github.com/Hazel-Heejeong-Nam/VAGUE.git.

arxiv情報

著者	Heejeong Nam,Jinwoo Ahn,Keummin Ka,Jiwan Chung,Youngjae Yu
発行日	2025-03-11 13:29:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VAGUE: Visual Contexts Clarify Ambiguous Expressions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー