SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

要約

マルチモーダル大手言語モデル（MLLM）は、問題を解決するタスクで印象的な成功を収めていますが、空間的理解のための能力はあまり探求されていません。
この作業は重要な質問を調査しています。既存のMLLMは3D空間的認識と理解能力を持っていますか？
具体的には、このペーパーで次の貢献をします。（i）VGBenchを導入します。VGBenchは、視覚的なジオメトリの知覚、たとえばカメラのポーズやモーション推定のMLLMを評価するために特別に設計されたベンチマークを紹介します。
（ii）これまでで最も包括的で多様なマルチモーダル空間理解ベンチマークであるSpatialScoreを提案し、VGBenchを他の11の既存のデータセットからの関連データと統合します。
このベンチマークは、さまざまな空間理解タスク、モダリティ、およびQA形式の28Kサンプルと、慎重にキュレーションされた挑戦的なサブセット、SpatialScoreハードで構成されています。
（iii）空間的理解のための9つの特殊なツールを組み込んだ新しいマルチエージェントシステムであるSpatialagentを開発し、計画通知とReactの推論パラダイムの両方をサポートします。
（iv）空間的推論における持続的な課題を明らかにする一方で、空間的な推論において永続的な課題を明らかにするために広範な評価を実施します。
SpatialScoreは貴重な洞察を提供し、MLLMの次の進化のための厳格なベンチマークとして機能すると考えています。

要約(オリジナル)

Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.

arxiv情報

著者	Haoning Wu,Xiao Huang,Yaohui Chen,Ya Zhang,Yanfeng Wang,Weidi Xie
発行日	2025-05-22 17:59:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー