ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

要約

ロボットの複雑な操作スキルの習得を阻む中心的な課題のひとつは、大規模なロボットのデモンストレーションを収集するための費用が高額になることである。これとは対照的に、人間は他人が環境と相互作用する様子を見ることで効率的に学習することができる。このギャップを埋めるために、我々は、本質的な時空間的なマニピュレータとオブジェクトの相互作用を捉え、表面的な視覚的差異に影響されない中核的な中間表現として、セマンティックアクションフローを導入する。ViSA-Flowは、ラベル付けされていない大規模な動画データから、この表現を自己教師付きで学習するフレームワークである。まず、大規模な人間と物体のインタラクション動画データから自動的に抽出された意味的アクションフローに対して、生成モデルを事前に学習し、操作構造に関するロバストな事前学習を行う。第二に、この事前学習は、同じ意味抽象化パイプラインを通して処理されたロボットのデモの小さなセット上で微調整することにより、ターゲットロボットに効率的に適応される。ViSA-Flowが、特に低データ領域において、人間のビデオ観察からロボットの実行に知識を効果的に伝達することにより、先行手法を凌駕する最先端の性能を達成することを、CALVINベンチマークと実世界のタスクを用いた広範な実験により実証する。ビデオはhttps://visaflow-web.github.io/ViSAFLOW。

要約(オリジナル)

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.

arxiv情報

著者	Changhe Chen,Quantao Yang,Xiaohao Xu,Nima Fazeli,Olov Andersson
発行日	2025-05-02 14:03:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー