OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

要約

空間推論は認知心理学の重要な側面であり、現在の視覚言語モデル（VLM）の大きなボトルネックとなっている。広範な研究が、左と右の区別、近くと遠くの区別、物体の数え方など、基本的な空間関係に対するVLMの理解を評価または改善することを目的としているが、これらのタスクは空間推論の最も基本的なレベルに過ぎない。本研究では、認知心理学に基づいた、空間推論のための包括的で挑戦的なベンチマークであるOmniSpatialを紹介する。OmniSpatialは、動的推論、複雑な空間論理、空間的相互作用、遠近法の4つの主要カテゴリをカバーし、50の細かいサブカテゴリを持つ。インターネットデータのクローリングと入念な手作業によるアノテーションを通じて、1.5K以上の質問と答えのペアを構築している。広範な実験により、オープンソースとクローズドソースの両方のVLM、および既存の推論と空間理解モデルが、包括的な空間理解において重大な制限を示すことが示された。さらに、失敗事例を分析し、今後の研究の方向性を提案する。

要約(オリジナル)

Spatial reasoning is a key aspect of cognitive psychology and remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs’ understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

arxiv情報

著者	Mengdi Jia,Zekun Qi,Shaochen Zhang,Wenyao Zhang,Xinqiang Yu,Jiawei He,He Wang,Li Yi
発行日	2025-06-03 17:58:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー