Prediction with Action: Visual Policy Learning via Joint Denoising Process

要約

拡散モデルは、画像編集やビデオ作成などの画像生成タスクにおいて優れた能力を実証し、物理世界をよく理解しています。
一方、拡散モデルは、拡散ポリシーとして知られるノイズ除去アクションによるロボット制御タスクでも有望であることが示されています。
拡散生成モデルと拡散ポリシーは、それぞれ画像予測とロボット動作という異なる機能を示しますが、技術的には同様のノイズ除去プロセスに従います。
ロボットタスクでは、将来の画像を予測してアクションを生成する能力は、物理世界の同じ基礎的なダイナミクスを共有しているため、高度に相関しています。
この洞察に基づいて、共同ノイズ除去プロセス内で画像予測とロボットアクションを統合する新しいビジュアルポリシー学習フレームワークである PAD を紹介します。
具体的には、PAD は拡散変換器 (DiT) を利用して画像とロボットの状態をシームレスに統合し、将来の画像とロボットの動作を同時に予測できるようにします。
さらに、PAD はロボットデモンストレーションと大規模ビデオデータセットの両方での共同トレーニングをサポートしており、深度画像などの他のロボットモダリティにも簡単に拡張できます。
PAD は、データ効率の高い模倣学習設定内で単一のテキスト条件付き視覚ポリシーを利用することにより、以前の手法を上回り、メタワールドベンチマーク全体で 26.3% という大幅な相対改善を達成しました。
さらに、PAD は、現実世界のロボット操作設定における目に見えないタスクに対する優れた一般化を実証し、最も強力なベースラインと比較して成功率が 28.0% 増加しました。
プロジェクトページ: https://sites.google.com/view/pad-paper

要約(オリジナル)

Diffusion models have demonstrated remarkable capabilities in image generation tasks, including image editing and video creation, representing a good understanding of the physical world. On the other line, diffusion models have also shown promise in robotic control tasks by denoising actions, known as diffusion policy. Although the diffusion generative model and diffusion policy exhibit distinct capabilities–image prediction and robotic action, respectively–they technically follow a similar denoising process. In robotic tasks, the ability to predict future images and generate actions is highly correlated since they share the same underlying dynamics of the physical world. Building on this insight, we introduce PAD, a novel visual policy learning framework that unifies image Prediction and robot Action within a joint Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on both robotic demonstrations and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark, by utilizing a single text-conditioned visual policy within a data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world robot manipulation settings with 28.0% success rate increase compared to the strongest baseline. Project page at https://sites.google.com/view/pad-paper

arxiv情報

著者	Yanjiang Guo,Yucheng Hu,Jianke Zhang,Yen-Jen Wang,Xiaoyu Chen,Chaochao Lu,Jianyu Chen
発行日	2024-11-27 09:54:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prediction with Action: Visual Policy Learning via Joint Denoising Process

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー