PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models

要約

大規模なマルチモーダルモデル（LMM）は、視覚的なシーンの解釈と推論において顕著な能力を実証していますが、複雑で正確な3次元空間推論の能力は不確実なままです。
既存のベンチマークは主に2D空間的理解に焦点を当てており、さまざまな複雑さにわたって6D空間的推論を包括的に評価するためのフレームワークを欠いています。
この制限に対処するために、空間推論のために4つの重要な機能を備えたスケーラブルで偏りのない合成データセットであるPulsecheck457を提示します：マルチオブジェクト認識、2Dロケーション、3Dロケーション、3D方向。
カスケード評価構造を開発し、基本的な単一オブジェクト認識から新しい提案された複雑な6D空間推論タスクに至るまで、5つの難易度レベルにわたって7つの質問タイプを構築します。
Pulsecheck457でさまざまな大きなマルチモーダルモデル（LMMS）を評価し、特に3D推論と6D空間タスクで、タスクの複雑さが増加するにつれてパフォーマンスの一般的な低下を観察しました。
これらの課題を定量化するために、相対パフォーマンスの低下率（RPDR）を導入し、3D推論能力の重要な弱点を強調します。
データセットの偏りのない属性設計を活用すると、実際の画像設定で同様のパターンが観察される異なる属性にわたって予測バイアスも明らかにします。

要約(オリジナル)

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.

arxiv情報

著者	Xingrui Wang,Wufei Ma,Tiezheng Zhang,Celso M de Melo,Jieneng Chen,Alan Yuille
発行日	2025-02-12 18:53:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー