PWM: Policy Learning with Multi-Task World Models

要約

Renforce Learning（RL）は、複雑なタスクで大きな進歩を遂げましたが、異なる実施形態を持つマルチタスク設定での苦労をしています。
ワールドモデルの方法は、環境のシミュレーションを学習することによりスケーラビリティを提供しますが、多くの場合、ポリシー抽出のための非効率的な勾配のない最適化方法に依存します。
対照的に、勾配ベースの方法は分散が低いことを示しますが、不連続性を処理できません。
私たちの仕事は、よく正規化された世界モデルが実際のダイナミクスよりもスムーズな最適化ランドスケープを生成し、より効果的な一次最適化を促進できることを明らかにしています。
継続的な制御のための新しいモデルベースのRLアルゴリズムであるマルチタスクワールドモデル（PWM）を使用してポリシー学習を紹介します。
当初、世界モデルはオフラインデータで事前に訓練されており、その後、タスクごとに10分以内に1次最適化を使用してポリシーが抽出されます。
PWMは、最大152のアクションディメンションを備えたタスクを効果的に解決し、グラウンドトゥラースダイナミクスを使用するメソッドを上回ります。
さらに、PWMは80タスクの設定にスケールし、コストのかかるオンライン計画に依存せずに既存のベースラインよりも最大27％高い報酬を達成します。
視覚化とコードはhttps://www.imgeorgiev.com/pwm/で入手できます。

要約(オリジナル)

Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World model methods offer scalability by learning a simulation of the environment but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without relying on costly online planning. Visualizations and code are available at https://www.imgeorgiev.com/pwm/.

arxiv情報

著者	Ignat Georgiev,Varun Giridhar,Nicklas Hansen,Animesh Garg
発行日	2025-02-24 06:56:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PWM: Policy Learning with Multi-Task World Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー