V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

要約

マルチモーダル大手言語モデル（MLLM）の最近の進歩により、さまざまなマルチモーダルベンチマークにわたって大幅に改善されました。
ただし、評価が静的データセットからオープンワールドの動的環境に移行するにつれて、視覚中心のタスクがなく、実際の意思決定に必要な多様な推論スキルを評価できないため、現在のゲームベースのベンチマークは不十分なままです。
これに対処するために、MLLMの視覚的推論機能を評価するために設計されたゲームベースの評価フレームワークである視覚的中心の複数の能力ゲーム評価（V-Mage）を紹介します。
V-Mageは、30以上の手作りレベルを備えた5つの多様なゲームを特徴としており、ポジショニング、軌跡追跡、タイミング、視覚メモリなどのコアビジュアルスキルのテストモデルと、長期的な計画や審議などの高レベルの推論を備えています。
V-Mageを使用して、主要なMLLMSを評価し、視覚的認識と推論における重要な課題を明らかにしています。
すべてのゲーム環境で、ELO評価の比較によって決定される最高パフォーマンスのMLLMは、人間と比較してかなりのパフォーマンスギャップを示します。
私たちの調査結果は、モデルによって行われたさまざまなタイプの知覚エラーを含む重大な制限を強調し、エージェント中心の視点から改善するための潜在的な手段を示唆しています。
コードはhttps://github.com/csu-jpg/v-mageで入手できます。

要約(オリジナル)

Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at https://github.com/CSU-JPG/V-MAGE.

arxiv情報

著者	Xiangxi Zheng,Linjie Li,Zhengyuan Yang,Ping Yu,Alex Jinpeng Wang,Rui Yan,Yuan Yao,Lijuan Wang
発行日	2025-04-08 15:43:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー