Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics

要約

複数のインテリジェンスの理論は、認知能力の階層的な性質を強調しています。
空間人工知能を進めるために、視覚言語モデル（VLMS）の5つの基本的な空間能力（BSA）を定義する心理測定フレームワークの先駆者：空間的知覚、空間的関係、空間方向、精神回転、および空間視覚化。
9つの検証済みの心理測定実験を通じて13の主流VLMS 13の主流VLMSは、人間に対して有意なギャップ（平均スコア24.95対68.38）、3つの重要な調査結果を示し、VLMSミラーのヒト階層（2D回転で最も強く、3D回転で最も弱い）を独立したBSAS（Pearson’s Human Hierarchies）
R <0.4）; 2）QWEN2-VL-7Bなどの小型モデルは、QWENがリード（30.82）とInternVL2の遅れ（19.6）で大規模な対応物を上回ります。 3）考え方のチェーン（0.100精度ゲイン）や5ショットトレーニング（0.259の改善）などの介入は、建築上の制約からの制限を示しています。識別された障壁には、動的シミュレーションの弱いジオメトリが弱いことが含まれます。心理測定BSAをVLM機能にリンクすることにより、空間インテリジェンス評価のための診断ツールキット、具体化されたAI開発のための方法論的基礎、および人間のような空間知能を達成するための認知科学に基づいたロードマップを提供します。

要約(オリジナル)

The Theory of Multiple Intelligences underscores the hierarchical nature of cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13 mainstream VLMs through nine validated psychometric experiments reveals significant gaps versus humans (average score 24.95 vs. 68.38), with three key findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation, weakest in 3D rotation) with independent BSAs (Pearson’s r<0.4); 2) Smaller models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading (30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought (0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from architectural constraints. Identified barriers include weak geometry encoding and missing dynamic simulation. By linking psychometric BSAs to VLM capabilities, we provide a diagnostic toolkit for spatial intelligence evaluation, methodological foundations for embodied AI development, and a cognitive science-informed roadmap for achieving human-like spatial intelligence.

arxiv情報

著者	Wenrui Xu,Dalin Lyu,Weihang Wang,Jie Feng,Chen Gao,Yong Li
発行日	2025-02-17 14:50:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー