One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion

要約

従来の強化学習（RL）手法では、タスク固有の報酬が必要であり、訓練地形が増えるにつれて増大するデータセットを活用するのに苦労する。我々はDreamPolicyを提案する。DreamPolicyは、オフラインデータと拡散駆動型モーション合成を系統的に統合することにより、単一のポリシーで多様な地形をマスターし、未知のシナリオにゼロショットを汎化することを可能にする統合フレームワークである。その中核となるDreamPolicyは、ヒューマノイドモーションイメージ（HMI）を導入しています。HMIは、様々な異なる地形に特化したポリシーからのロールアウトを集約することでキュレートされた、自己回帰的な地形を意識した拡散プランナーによって合成された未来の状態予測です。手間のかかるリターゲティングを必要とするヒューマンモーションデータセットとは異なり、我々のデータはヒューマノイドの運動特性を直接キャプチャしており、拡散プランナーが地形固有の物理的制約をエンコードした「夢見た」軌道を合成することを可能にしている。これらの軌道は、HMI条件付きポリシーの動的目標として機能し、手作業による報酬工学を回避し、地形横断的な汎化を可能にする。DreamPolicyは、従来の手法のスケーラビリティの限界に対処している。従来のRLが増大するデータセットを利用できないのに対して、我々のフレームワークはオフラインのデータが増えてもシームレスにスケールする。データセットが拡大するにつれて、拡散事前学習はより豊富なロコモーションスキルを学習し、ポリシーは再トレーニングなしで新しい地形をマスターするためにこれを活用する。実験によれば、DreamPolicyは訓練環境において平均90%の成功率を達成し、未知の地形では一般的な手法よりも平均20%高い成功率を達成する。また、先行アプローチが破綻するような摂動シナリオや複合シナリオにも一般化する。オフラインデータ、拡散ベースの軌道合成、ポリシー最適化を統合することで、DreamPolicyは「1タスク1ポリシー」のボトルネックを克服し、スケーラブルなデータ駆動型ヒューマノイド制御のパラダイムを確立する。

要約(オリジナル)

Humanoid locomotion faces a critical scalability challenge: traditional reinforcement learning (RL) methods require task-specific rewards and struggle to leverage growing datasets, even as more training terrains are introduced. We propose DreamPolicy, a unified framework that enables a single policy to master diverse terrains and generalize zero-shot to unseen scenarios by systematically integrating offline data and diffusion-driven motion synthesis. At its core, DreamPolicy introduces Humanoid Motion Imagery (HMI) – future state predictions synthesized through an autoregressive terrain-aware diffusion planner curated by aggregating rollouts from specialized policies across various distinct terrains. Unlike human motion datasets requiring laborious retargeting, our data directly captures humanoid kinematics, enabling the diffusion planner to synthesize ‘dreamed’ trajectories that encode terrain-specific physical constraints. These trajectories act as dynamic objectives for our HMI-conditioned policy, bypassing manual reward engineering and enabling cross-terrain generalization. DreamPolicy addresses the scalability limitations of prior methods: while traditional RL fails to exploit growing datasets, our framework scales seamlessly with more offline data. As the dataset expands, the diffusion prior learns richer locomotion skills, which the policy leverages to master new terrains without retraining. Experiments demonstrate that DreamPolicy achieves average 90% success rates in training environments and an average of 20% higher success on unseen terrains than the prevalent method. It also generalizes to perturbed and composite scenarios where prior approaches collapse. By unifying offline data, diffusion-based trajectory synthesis, and policy optimization, DreamPolicy overcomes the ‘one task, one policy’ bottleneck, establishing a paradigm for scalable, data-driven humanoid control.

arxiv情報

著者	Yahao Fan,Tianxiang Gui,Kaiyang Ji,Shutong Ding,Chixuan Zhang,Jiayuan Gu,Jingyi Yu,Jingya Wang,Ye Shi
発行日	2025-06-03 03:10:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー