ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

要約

ビジョン言語モデル（VLM）は、視覚コンテンツについての理解と推論において顕著な能力を実証していますが、クロスビューポイントの理解と空間的推論を必要とするタスクには大きな課題があります。
重要な制限を特定します。現在のVLMSは、主に自己中心的な空間推論（カメラの観点から）で優れていますが、別のエンティティの空間的な参照フレームを採用するために必要な場合、配分の視点に一般化することができません。
正確な方向ラベルを生成する自動3D解釈パイプラインによってサポートされている5つの異なるタスクタイプにわたって、マルチビューポイント空間ローカリゼーション認識評価のために特別に設計された最初の包括的なベンチマークであるViewSpatial-Benchを紹介します。
Viewspatial-Benchでの多様なVLMの包括的な評価は、大きなパフォーマンスの格差を明らかにしています。モデルはカメラの視点タスクで合理的なパフォーマンスを示しますが、人間の観点から推論すると精度が低下します。
マルチパース視点の空間データセットでVLMを微調整することにより、タスク全体で46.24％の全体的なパフォーマンス改善を達成し、アプローチの有効性を強調します。
私たちの研究は、具体化されたAIシステムにおける空間インテリジェンスの重要なベンチマークを確立し、3D空間関係をモデリングすることでVLMの対応する空間的理解能力が強化されるという経験的証拠を提供します。

要約(オリジナル)

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera’s perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity’s spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs’ corresponding spatial comprehension capabilities.

arxiv情報

著者	Dingming Li,Hongxing Li,Zixuan Wang,Yuchen Yan,Hang Zhang,Siqi Chen,Guiyang Hou,Shengpei Jiang,Wenqi Zhang,Yongliang Shen,Weiming Lu,Yueting Zhuang
発行日	2025-05-27 17:59:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー