ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

要約

ロボットが複雑な操作スキルを獲得できないようにする中心的な課題の1つは、大規模なロボットデモンストレーションを収集するための法外なコストです。
対照的に、人間は、他の人が自分の環境と対話するのを見ることで効率的に学ぶことができます。
このギャップを埋めるために、セマンティックアクションフローをコア中間表現として導入します。本質的な視覚的違いから不変の時空間マニピュレーターとオブジェクトの相互作用を捉えます。
Visa-Flowを提示します。Visa-Flowは、この表現を学習しているこの表現を学習していることを紹介します。
第一に、生成モデルは、大規模な人間とオブジェクトの相互作用ビデオデータから自動的に抽出されたセマンティックアクションフローで事前に訓練され、操作構造よりも堅牢な事前の学習を学びます。
第二に、これは、同じセマンティック抽象化パイプラインを通じて処理されたロボットデモの小さなセットで微調整することにより、ターゲットロボットに効率的に適合しています。
Calvin BenchmarkおよびVisa-Flowが最先端のパフォーマンスを達成する現実世界のタスクに関する広範な実験を通じて、特に低データ体制では、人間のビデオ観察からロボット実行に効果的に伝達することにより、以前の方法よりも優れています。
ビデオはhttps://visaflow-web.github.io/visaflowで入手できます。

要約(オリジナル)

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.

arxiv情報

著者	Changhe Chen,Quantao Yang,Xiaohao Xu,Nima Fazeli,Olov Andersson
発行日	2025-05-12 13:37:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー