Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

要約

大規模な言語モデル（LLMS）を使用したテキストの推論が大幅に進歩しているため、大規模なビジョン言語モデル（LVLMS）のマルチモーダル推論能力を高めることに関心が高まっています。
ただし、既存の方法は主にマルチモーダルの推論に近づき、テキスト中心の方法でマルチモーダルの推論に取り組みます。ここでは、推論と回答の両方の派生の両方がテキストを通じて行われ、唯一の違いはマルチモーダル入力の存在です。
その結果、これらの方法は、人間が精神視覚化と操作を通じて達成する正確な幾何学的理解と継続的な空間追跡能力を必要とする空間推論タスクの基本的な制限に遭遇することがよくあります。
制限に対処するために、視覚空間での基本的な描画操作を通じてLVLMが推論できるようにする新しいパラダイムである宇宙での理由への図面を提案します。
モデルに境界ボックスに注釈を付けたり、補助ラインを描画するなど、基本的な描画操作を装備することにより、直接的な視覚操作を通じて空間的関係を表現して分析することができます。
この機能を育むために、3段階のトレーニングフレームワークを開発します。合成データを使用したコールドスタートトレーニング、基本的な描画能力を確立し、自己反射行動を強化するための反射的な拒絶サンプリング、ターゲットの報酬を直接最適化するための学習を強化します。
広範な実験は、Vilasrという名前のモデルが、迷路のナビゲーション、静的な空間推論、ビデオベースの推論、およびマルチビューベースの推論タスクを含む、多様な空間推論ベンチマーク全体で既存の方法を常に上回ることを示しています。

要約(オリジナル)

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

arxiv情報

著者	Junfei Wu,Jian Guan,Kaituo Feng,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tieniu Tan
発行日	2025-06-11 17:41:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー