Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

要約

モデルベースの強化学習 (RL) は、その高いサンプル効率により、さまざまな連続制御タスクで目覚ましい成功を収めています。
オンラインで計画を実行する際の計算コストを節約するために、最近の実践では、トレーニング段階で最適化されたアクションシーケンスを RL ポリシーに抽出する傾向があります。
蒸留には計画の先見性と RL 政策の探索能力の両方を組み込むことができますが、これらの方法の理論的理解はまだ不明です。
この論文では、モデルベースの計画からポリシーを抽出するアプローチを開発することにより、ソフトアクタークリティカル (SAC) のポリシー改善ステップを拡張します。
次に、このような政策改善アプローチには、単調な改善と SAC で定義された最大値への収束が理論的に保証されていることを示します。
私たちは効果的な設計の選択について議論し、将来の複数のタイムステップにわたって一緒にポリシーを更新する実用的なアルゴリズムであるモデルベースの計画抽出 (MPDP) として理論を実装します。
広範な実験により、MPDP は、MuJoCo の 6 つの連続制御ベンチマークタスクにおいて、モデルフリーおよびモデルベースの計画アルゴリズムの両方よりも優れたサンプル効率と漸近パフォーマンスを達成することが示されています。

要約(オリジナル)

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm — Model-based Planning Distilled to Policy (MPDP) — that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.

arxiv情報

著者	Chuming Li,Ruonan Jia,Jie Liu,Yinmin Zhang,Yazhe Niu,Yaodong Yang,Yu Liu,Wanli Ouyang
発行日	2023-07-24 16:52:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー