EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks

要約

マルチモーダルの大手言語モデル（MLLM）は、具体化された知性を画期的に進歩させていますが、複雑な長老タスクの空間的推論において大きな課題に直面しています。
このギャップに対処するために、具体化されたエージェントの空間的理解を高めるためのグラフ誘導チェーン（COT）推論を統合する動的なシーンを統合する新しいフレームワークであるEmbodiedVSR（具体化された視覚空間推論）を提案します。
動的シーングラフを通じて構造化された知識表現を明示的に構築することにより、この方法により、タスク固有の微調整なしでゼロショット空間推論が可能になります。
このアプローチは、複雑な空間的関係を解き放つだけでなく、推論ステップを実用的な環境ダイナミクスに合わせます。
パフォーマンスを厳密に評価するために、エスパティアベンチマークを紹介します。これは、微細な空間注釈と適応的なタスクの難易度を備えた実際の具体化されたシナリオを含む包括的なデータセットです。
実験は、特に反復的な環境相互作用を必要とする長老のタスクで、既存のMLLMベースのメソッドを精度と推論的な一貫性を大きく上回ることを示しています。
結果は、構造化された説明可能な推論メカニズムを装備した場合、具体化されたインテリジェンスのMLLMの未開発の可能性を明らかにし、実際の空間アプリケーションでより信頼できる展開への道を開いています。
コードとデータセットはまもなくリリースされます。

要約(オリジナル)

While multimodal large language models (MLLMs) have made groundbreaking progress in embodied intelligence, they still face significant challenges in spatial reasoning for complex long-horizon tasks. To address this gap, we propose EmbodiedVSR (Embodied Visual Spatial Reasoning), a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning to enhance spatial understanding for embodied agents. By explicitly constructing structured knowledge representations through dynamic scene graphs, our method enables zero-shot spatial reasoning without task-specific fine-tuning. This approach not only disentangles intricate spatial relationships but also aligns reasoning steps with actionable environmental dynamics. To rigorously evaluate performance, we introduce the eSpatial-Benchmark, a comprehensive dataset including real-world embodied scenarios with fine-grained spatial annotations and adaptive task difficulty levels. Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence, particularly in long-horizon tasks requiring iterative environment interaction. The results reveal the untapped potential of MLLMs for embodied intelligence when equipped with structured, explainable reasoning mechanisms, paving the way for more reliable deployment in real-world spatial applications. The codes and datasets will be released soon.

arxiv情報

著者	Yi Zhang,Qiang Zhang,Xiaozhu Ju,Zhaoyang Liu,Jilei Mao,Jingkai Sun,Jintao Wu,Shixiong Gao,Shihan Cai,Zhiyuan Qin,Linkai Liang,Jiaxu Wang,Yiqun Duan,Jiahang Cao,Renjing Xu,Jian Tang
発行日	2025-03-14 05:06:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー