Text-Aware Diffusion for Policy Learning

要約

特定の目標を達成したり、望ましい動作を実行したりするためのエージェントのトレーニングは、特に専門家のデモンストレーションがない場合には、強化学習によって達成されることがよくあります。
ただし、強化学習を通じて新しい目標や行動をサポートするには、適切な報酬関数をアドホックに設計する必要があり、すぐに扱いにくくなります。
この課題に対処するために、私たちは Text-Aware Diffusion for Policy Learning (TADPoLe) を提案します。これは、事前学習済みの凍結されたテキスト条件付き拡散モデルを使用して、テキストに合わせたポリシー学習のための密なゼロショット報酬信号を計算します。
私たちは、大規模な事前トレーニング済み生成モデルは、テキストに合わせて動作するだけでなく、インターネット規模のトレーニングデータから要約された自然さの概念にも合わせて動作するようにポリシーを監視できる豊富な事前分布をエンコードしていると仮説を立てます。
私たちの実験では、TADPoLe がヒューマノイド環境と犬環境の両方で、自然言語によって指定された新しい目標達成と継続的な移動行動のためのポリシーを学習できることを実証しました。
行動は、真実の報酬や専門家のデモンストレーションなしでゼロショットで学習され、人間の評価によると定性的により自然です。
さらに、TADPoLe が、ドメイン内のデモンストレーションにアクセスしなくても、メタワールド環境のロボット操作タスクに適用された場合に競争力のあるパフォーマンスを発揮することを示します。

要約(オリジナル)

Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations. However, supporting novel goals or behaviors through reinforcement learning requires the ad-hoc design of appropriate reward functions, which quickly becomes intractable. To address this challenge, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning. We hypothesize that large-scale pretrained generative models encode rich priors that can supervise a policy to behave not only in a text-aligned manner, but also in alignment with a notion of naturalness summarized from internet-scale training data. In our experiments, we demonstrate that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments. The behaviors are learned zero-shot without ground-truth rewards or expert demonstrations, and are qualitatively more natural according to human evaluation. We further show that TADPoLe performs competitively when applied to robotic manipulation tasks in the Meta-World environment, without having access to any in-domain demonstrations.

arxiv情報

著者	Calvin Luo,Mandy He,Zilai Zeng,Chen Sun
発行日	2024-10-31 16:49:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text-Aware Diffusion for Policy Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー