SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

要約

マルチモーダル大手言語モデル（MLLM）は、さまざまなマルチモーダルタスクで顕著な進歩を遂げています。
宇宙でより高いインテリジェンスを追求するために、MLLMは複雑で動的なタスクを処理するために複数の原子空間機能を統合する必要があります。
ただし、既存のベンチマークは、原子レベルから組成レベルまでの一般的なMLLMの空間知能を包括的に評価するのに苦労しています。
このギャップを埋めるために、組成の空間評価のための包括的なベンチマークであるSpace-10を紹介します。
Space-10では、10の原子空間能力を定義します。これらは、組み合わせて8つの組成能力を形成します。
これらの定義に基づいて、高品質で多様な質問回答（QA）ペアを生成するために、新しい階層的な注釈パイプラインを提案します。
150時間以上の人間の専門家の努力により、Point Cloud入力やマルチ選択QAなどのさまざまな評価設定をカバーするSpace-10で、811の実際の屋内シーンで5kを超えるQAペアを取得します。
私たちは、Space-10で一般的なMLLMの広範な評価を実施し、最も高度なMLLMでさえ、人間に大きなマージンで遅れていることがわかります。
慎重な研究を通じて、MLLMコミュニティに利益をもたらすいくつかの重要な調査結果も描きます。
たとえば、カウント能力が既存のMLLMの組成空間機能を大幅に制限することが大幅に制限されることを明らかにします。
評価コードとベンチマークデータセットは、https：//github.com/cuzyoung/space-10で入手できます。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. The evaluation code and benchmark datasets are available at https://github.com/Cuzyoung/SpaCE-10.

arxiv情報

著者	Ziyang Gong,Wenhao Li,Oliver Ma,Songyuan Li,Jiayi Ji,Xue Yang,Gen Luo,Junchi Yan,Rongrong Ji
発行日	2025-06-09 17:41:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー