AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

要約

VLMSの空間推論を強化するための構造化された空間接地を備えた効率的なアプローチである、自己麻酔の新しい方法を提示します。
最小限の手動監督と大規模な視覚的質問（VQA）ペアのオートラベル付けと組み合わせることにより、私たちのアプローチは、ソーシャルナビゲーションタスクにおけるVLMSの限られた空間的理解の課題に取り組みます。
トレーニング中に階層的な2ラウンドのVQA戦略を適用することにより、独立空間はシナリオのグローバルおよび詳細な理解の両方を達成し、より正確な空間的認識、動きの予測、思考の連鎖（COT）推論、最終アクション、および他のSOTAアプローチと比較して説明を実証します。
これらの5つのコンポーネントは、包括的なソーシャルナビゲーションの推論に不可欠です。
私たちのアプローチは、4つの重要な側面にわたってモデルパフォーマンスを比較するために相対ランキングを割り当てた相互検証スコアと人間の評価者を提供する、エキスパートシステム（GPT-4O、GEMINI 2.0 Flash、およびClaude 3.5 Sonnet）の両方を使用して評価されました。
空間推論能力の強化によって増強された独立空間は、知覚と予測（10.71％まで）、推論（最大16.26％）、アクション（最大20.50％）、および説明（18.73％まで）（最大16.26％）、および説明（最大18.73％）のエキスパートシステムからの平均的な交差検証スコアによる大幅な改善を示しています。

要約(オリジナル)

We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs’ spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs’ limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches. These five components are essential for comprehensive social navigation reasoning. Our approach was evaluated using both expert systems (GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet) that provided cross-validation scores and human evaluators who assigned relative rankings to compare model performances across four key aspects. Augmented by the enhanced spatial reasoning capabilities, AutoSpatial demonstrates substantial improvements by averaged cross-validation score from expert systems in: perception & prediction (up to 10.71%), reasoning (up to 16.26%), action (up to 20.50%), and explanation (up to 18.73%) compared to baseline models trained only on manually annotated data.

arxiv情報

著者	Yangzhe Kong,Daeun Song,Jing Liang,Dinesh Manocha,Ziyu Yao,Xuesu Xiao
発行日	2025-03-10 17:27:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー