Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

要約

人間は、連続した視覚観察から空間を記憶する視覚空間知能を持っています。
しかし、百万規模のビデオデータセットでトレーニングされたマルチモーダル大規模言語モデル (MLLM) も、ビデオから「空間で考える」ことができるでしょうか。
我々は、5,000 を超える質問と回答のペアからなる新しいビデオベースの視覚空間知能ベンチマーク (VSI-Bench) を提示し、MLLM が人間以下ではあるものの競争力のある視覚空間知能を示すことを発見しました。
私たちは、モデルが空間内でどのように思考するかを言語的および視覚的に表現するためにモデルを調査しました。その結果、MLLM がより高いベンチマークパフォーマンスを達成するには空間推論機能が依然として主要なボトルネックである一方で、ローカルワールドモデルと空間認識がこれらのモデル内に出現していることがわかりました。
特に、一般的な言語推論技術（思考の連鎖、自己一貫性、思考のツリーなど）はパフォーマンスを向上させることができませんが、質問応答中に認知マップを明示的に生成すると、MLLMの空間距離能力が向上します。

要約(オリジナル)

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also “think in space” from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive – though subhuman – visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs’ spatial distance ability.

arxiv情報

著者	Jihan Yang,Shusheng Yang,Anjali W. Gupta,Rilyn Han,Li Fei-Fei,Saining Xie
発行日	2024-12-18 18:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー