Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

要約

私たちは、ゼロショットのロボット操作、つまりテスト時の適応なしで新しいシーンで目に見えないオブジェクトと対話できる、一般化可能な目標条件付きポリシーの学習を目指しています。
一般的なアプローチは、このような一般化のために大量のデモンストレーションデータに依存しますが、私たちは、Web ビデオを活用して妥当なインタラクションプランを予測し、タスクに依存しない変換を学習して現実世界でロボットの動作を取得するアプローチを提案します。
私たちのフレームワークである Track2Act は、目標に基づいて将来のタイムステップで画像内の点がどのように移動するかのトラックを予測し、日常の物体を操作する人間やロボットのビデオなど、Web 上のさまざまなビデオを使用してトレーニングできます。
これらの 2D トラック予測を使用して、操作対象のオブジェクトの一連の剛体変換を推論し、開ループ方式で実行できるロボットのエンドエフェクターのポーズを取得します。
次に、いくつかの実施形態固有のデモンストレーションで訓練された閉ループポリシーを通じて残留アクションを予測することにより、この開ループ計画を改良します。
スケーラブルに学習された軌道予測と、最小限のドメイン内ロボット固有データを必要とする残差ポリシーを組み合わせるこのアプローチにより、汎用化可能な多様なロボット操作が可能になることを示し、目に見えないタスク、オブジェクト、シーンにわたる現実世界のロボット操作の幅広い結果を提示します。
。
https://homangab.github.io/track2act/

要約(オリジナル)

We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/

arxiv情報

著者	Homanga Bharadhwaj,Roozbeh Mottaghi,Abhinav Gupta,Shubham Tulsiani
発行日	2024-08-08 23:18:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー