Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

要約

現実世界のロボット操作には、さまざまな具体化された環境にわたって一般化可能な視覚的動的表現を学習することが不可欠です。
ロボットのデモンストレーションデータの規模と多様性には限界があるため、最近の研究は人間のデータを使用した大規模な事前トレーニングに移行しています。
しかし、人間とロボットの形態的な違いにより、人間とロボットの領域に大きな不一致が生じ、これらの人間のデータで事前に訓練されたモデルを下流の操作タスクに一般化することが困難になります。
これに対処するために、我々は、容易に利用可能な人間とロボットのペアのビデオデータを利用して矛盾を埋める新しい適応パラダイムを提案します。
このパラダイムに従って、私たちの方法は人間とロボットの対照的な位置合わせ損失を利用して、人間とロボットのビデオのセマンティクスを調整し、パラメータ効率の高い方法で事前トレーニングされたモデルをロボット領域に適応させます。
この実験では、3 つの異なるベンチマークにわたる 25 のタスクで大幅な改善が示されており、シングルタスク、言語条件付きマルチタスク設定がカバーされ、2 つの異なる事前トレーニング済みモデルが評価されています。
大規模な RLBench ベンチマークでは、私たちの適応方法は、複数のタスクにわたって、事前トレーニングされた R3M モデルと比較して成功率で平均 $8.9\%$ の向上を達成しました。
承認され次第、コードとモデルをリリースします。

要約(オリジナル)

Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of $8.9\%$ in success rate over the pre-trained R3M model across multiple tasks. We will release the code and models upon acceptance.

arxiv情報

著者	Jiaming Zhou,Teli Ma,Kun-Yu Lin,Ronghe Qiu,Zifan Wang,Junwei Liang
発行日	2024-06-20 11:57:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー