Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

要約

視覚的表現は、ジェネラリストのロボット政策の開発において重要な役割を果たします。
以前のビジョンエンコーダーは、通常、単一イメージの再構築または2イメージの対照学習で事前に訓練されており、静的情報をキャプチャする傾向があり、しばしば具体化されたタスクに不可欠な動的な側面を無視します。
最近、ビデオ拡散モデル（VDMS）は、将来のフレームを予測し、物理的な世界を強く理解する能力を示しています。
VDMは本質的に現在の静的情報と予測された将来のダイナミクスの両方を含む視覚的表現を本質的に生成し、それによりロボットアクション学習のための貴重なガイダンスを提供すると仮定します。
この仮説に基づいて、VDMS内の予測された将来の表現を条件とする暗黙の逆ダイナミクスモデルを学習するビデオ予測ポリシー（VPP）を提案します。
より正確な将来を予測するために、インターネットの人間の操作データとともに、ロボットデータセット上の事前に訓練されたビデオファンデーションモデルを微調整します。
実験では、VPPは、以前の最先端と比較して、Calvin ABC-D一般化ベンチマークで18.6 \％の相対的な改善を達成し、複雑な実世界の器用な操作タスクの成功率の31.6％の増加を示しています。
https://video-prediction-policy.github.ioのプロジェクトページ

要約(オリジナル)

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

arxiv情報

著者	Yucheng Hu,Yanjiang Guo,Pengchao Wang,Xiaoyu Chen,Yen-Jen Wang,Jianke Zhang,Koushil Sreenath,Chaochao Lu,Jianyu Chen
発行日	2025-05-04 04:28:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー