DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

要約

制御行動から将来の結果を予測する能力は、物理的推論の基本である。しかし、このような予測モデルは、しばしばワールドモデルと呼ばれ、学習が困難なままであり、通常、オンラインポリシー学習によるタスクに特化したソリューションのために開発されている。ワールドモデルの真の可能性を引き出すために、我々は、1）オフラインで事前に収集された軌道上で学習可能であること、2）テスト時の動作最適化をサポートすること、3）タスクにとらわれない推論を促進すること、を主張する。そのために、視覚世界を再構成することなく視覚ダイナミクスをモデル化する新しい手法であるDINOワールドモデル（DINO-WM）を提案する。DINO-WMは、DINOv2で事前に訓練された空間的なパッチ特徴を活用し、将来のパッチ特徴を予測することで、オフラインの行動軌跡から学習することを可能にする。これにより、DINO-WMは行動シーケンスの最適化を通じて観察目標を達成することができ、目標特徴を予測対象として扱うことでタスクにとらわれないプランニングが容易になる。我々は、DINO-WMが6つの環境において、専門家のデモンストレーション、報酬モデリング、事前学習された逆モデルなしに、テスト時にゼロショット行動解を達成することを実証し、任意に設定された迷路、様々な物体形状を持つプッシュ操作、多粒子シナリオなどの多様なタスクファミリーにおいて、先行する最先端研究よりも優れていることを示す。

要約(オリジナル)

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models’ true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

arxiv情報

著者	Gaoyue Zhou,Hengkai Pan,Yann LeCun,Lerrel Pinto
発行日	2025-02-01 02:40:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー