From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

要約

ロボット操作における一般化を達成することは、特に目に見えないシナリオや新しいタスクにとって、依然として重要な課題です。
現在のビジョン言語アクション（VLA）モデルは、一般的な視覚言語モデル（VLM）の上に構築されていますが、具体化されたデータセットで一般的な希少性と不均一性のために、堅牢なゼロショットパフォーマンスを達成することはまだ不足しています。
これらの制限に対処するために、FSD（見ることから行うまで）を提案します。これは、空間関係の推論を通じて中間表現を生成し、ロボット操作のための細かいガイダンスを提供する新しいビジョン言語モデルです。
私たちのアプローチでは、空間座標を視覚信号と整列させる自己整合メカニズムとトレーニングのための階層データパイプラインを組み合わせています。
広範な実験を通じて、FSDの能力を「SEES」と「DOING」の両方で包括的に検証し、一般的な空間的推論と具体化された参照能力のために8つのベンチマークにわたって優れたパフォーマンスを達成し、より挑戦的なベンチマークVabenchで具体化しました。
また、ロボット操作におけるゼロショット機能を検証し、SimplerEnvと実際のロボット設定の両方でベースライン方法よりも大幅なパフォーマンスの改善を示しました。
実験結果は、FSDが8つの現実世界のタスクにわたってSimplerENVで54.1％の成功率と72％の成功率を達成し、最強のベースラインを30％上回ることを示しています。

要約(オリジナル)

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD’s capabilities in both ‘seeing’ and ‘doing,’ achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 54.1% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

arxiv情報

著者	Yifu Yuan,Haiqin Cui,Yibin Chen,Zibin Dong,Fei Ni,Longxin Kou,Jinyi Liu,Pengyi Li,Yan Zheng,Jianye Hao
発行日	2025-05-13 13:20:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー