Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

要約

大規模なビジョン言語モデル（VLM）は、長い間空間的推論タスクに苦労してきました。
驚くべきことに、2つのオブジェクトのみの間の「下」または「背後」の関係を認識するなど、単純な空間的推論タスクでさえ、現在のVLMに大きな課題をもたらします。
この作業では、メカニズムの解釈可能性のレンズからの空間的推論の課題を研究し、モデルの内部状態に飛び込み、画像トークンとテキストトークンの相互作用を調べます。
中間層を介して画像上の注意分布を追跡することにより、成功した空間推論は、特に馴染みのない空間的関係の間で異なる注意分布とその注意分布を整合するモデルの能力と強く相関していることがわかります。
これらの調査結果に動機付けられて、私たちは、自信があるときに、非常に関連性の高い地域での注意を磨き、注意ウィンドウを広げて信頼度が低いときにより広いコンテキストを検討するために、推論時の信頼性スコアに基づいてAdaptVisを提案します。
このトレーニングなしのデコード方法は、WhatsUpやVSRなどの空間推論ベンチマークの大幅な改善（たとえば、最大50の絶対点改善）を示しています。
https://github.com/shiqichen17/adaptvisで研究目的でコードとデータを公開しています。

要約(オリジナル)

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing ‘under’ or ‘behind’ relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model’s internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model’s ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.

arxiv情報

著者	Shiqi Chen,Tongyao Zhu,Ruochen Zhou,Jinghan Zhang,Siyang Gao,Juan Carlos Niebles,Mor Geva,Junxian He,Jiajun Wu,Manling Li
発行日	2025-03-04 18:01:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー