Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

要約

大規模なロボットシステムは通常、タスクのテキスト指示に依存しますが、この研究では別のアプローチを探求しています。つまり、ロボットは人間の観察から直接タスクを推測できるでしょうか?
この変化には、人間の意図を解読し、それを物理的な制約と環境内で実行可能なアクションに変換するロボットの能力が必要になります。
ロボット用の新しいエンドツーエンドのビデオベース学習フレームワークである Vid2Robot を紹介します。
操作タスクのビデオデモンストレーションと現在の視覚的観察が与えられると、Vid2Robot はロボットのアクションを直接生成します。
これは、人間のビデオとロボットの軌跡の大規模なデータセットでトレーニングされた統一表現モデルを通じて実現されます。
このモデルは、クロスアテンションメカニズムを活用して、プロンプトビデオ機能をロボットの現在の状態に融合し、観察されたタスクを模倣する適切なアクションを生成します。
ポリシーのパフォーマンスをさらに向上させるために、人間とロボットのビデオ表現間の整合性を高める補助的なコントラスト損失を提案します。
Vid2Robot を実世界のロボットで評価し、人間によるデモンストレーションビデオを使用した場合、他のビデオ条件付きポリシーと比較してパフォーマンスが 20% 向上することを実証しました。
さらに、私たちのモデルは、観察された動きをある物体から別の物体にうまく転送したり、長距離合成などの創発的な機能を示し、現実世界への応用の可能性を示しています。
プロジェクト Web サイト: vid2robot.github.io

要約(オリジナル)

While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot’s ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot’s current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: vid2robot.github.io

arxiv情報

著者	Vidhi Jain,Maria Attarian,Nikhil J Joshi,Ayzaan Wahid,Danny Driess,Quan Vuong,Pannag R Sanketi,Pierre Sermanet,Stefan Welker,Christine Chan,Igor Gilitschenski,Yonatan Bisk,Debidatta Dwibedi
発行日	2024-03-19 17:47:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー