Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

要約

マルチモーダルの大手言語モデルの最近の進歩は、視覚的な質問に応答するブレークスルーを駆り立てています。
しかし、重要なギャップは続きます。「概念化」 – 視覚的な形のバリエーション、人間の推論の基本的な能力にもかかわらず、同じ概念について認識する能力と推論です。
この課題に対処するために、AIシステムの視覚抽象化の能力を評価および改善するために設計された6つのグラフベースのタスクを備えたデータセットであるVisual Graph Arena（VGA）を紹介します。
VGAは、多様なグラフレイアウト（例えば、カマダ – カワイ対平面など）を使用して、視覚的な形式とは無関係に推論をテストします。
最先端のビジョンモデルとマルチモーダルLLMを使用した実験では、顕著な格差が明らかになりました。人間はタスク全体でほぼ完璧な精度を達成しましたが、モデルは同型検出で完全に失敗し、パス/サイクルタスクで限られた成功を示しました。
さらに、真の理解ではなく、擬似知能パターンマッチングを示唆する行動異常を特定します。
これらの調査結果は、視覚的理解のための現在のAIモデルの基本的な制限を強調しています。
表現不変の推論の課題を分離することにより、VGAは、AI視覚モデルの人間のような概念化に向けて進歩を促進するためのフレームワークを提供します。
Visual Graph Arenaは、\ href {https://vga.csail.mit.edu/} {vga.csail.mit.edu}で入手できます。

要約(オリジナル)

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization’-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

arxiv情報

著者	Zahra Babaiee,Peyman M. Kiasari,Daniela Rus,Radu Grosu
発行日	2025-06-06 17:06:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー