InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

要約

言語命令と視覚的観測を生の低レベルのアクションにマッピングするために、視覚的命令と視覚的観測をマッピングするために、視覚言語アクションモデル（VLA）を活用するために、汎用ロボットシステムを達成するための大きな見込みがあります。
彼らの進歩にもかかわらず、既存のVLAは、タスクに関係のある視覚的特徴をアクションと微妙に相関させる傾向があり、トレーニングデータを超えて一般化能力を制限します。
この課題に取り組むために、VLAの空間推論能力を高めることにより、偽の相関の悪影響を軽減するシンプルで効果的なアプローチである、本質的な空間推論（Inspire）を提案します。
具体的には、Inspireは、「ロボットに対する[オブジェクト]はどの方向にあるのか」という質問を準備することにより、タスク関連要因へのVLAの注意をリダイレクトします。
言語の指示と、「右/左/左/下/フロント/バック/グラスト」との答えを調整し、グラウンド・トゥルースを使用してアクションを予測します。
特に、Inspireは既存の自己回帰VLAを強化するためのプラグインとして使用できます。これは、追加のトレーニングデータや他の大規模なモデルとのやり取りを必要としません。
シミュレーションと現実世界の環境の両方における広範な実験結果は、アプローチの有効性と柔軟性を示しています。
当社のコード、前提条件のモデル、およびデモは、https：//koorye.github.io/proj/inspireで公開されています。

要約(オリジナル)

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA’s attention to task-relevant factors by prepending the question ‘In which direction is the [object] relative to the robot?’ to the language instruction and aligning the answer ‘right/left/up/down/front/back/grasped’ and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach. Our code, pretrained models and demos are publicly available at: https://Koorye.github.io/proj/Inspire.

arxiv情報

著者	Ji Zhang,Shihan Wu,Xu Luo,Hao Wu,Lianli Gao,Heng Tao Shen,Jingkuan Song
発行日	2025-05-20 03:48:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー