Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

要約

制約付き強化学習 (CRL) では、エージェントは環境を探索して、制約を満たしながら最適なポリシーを学習します。
ペナルティ関数法は、制約を処理するための効果的なアプローチとして最近研究されており、制約付きの問題を制約のない問題に変換する目的に対して制約ペナルティを課します。
ただし、ポリシーのパフォーマンスと制約を満たす効率のバランスをとる適切なペナルティを選択するのは困難です。
この論文では、ペナルティメトリックネットワーク (PMN) によって生成される適応ペナルティを備えた、理論的に保証されたペナルティ関数手法である外部ペナルティポリシー最適化 (EPO) を提案します。
PMN は、さまざまな程度の制約違反に適切に対応し、効率的な制約の充足と安全な探索を可能にします。
EPO が収束保証により制約満足度を一貫して向上させることを理論的に証明します。
新しい代理関数を提案し、最悪の場合の制約違反と近似誤差を提供します。
実際には、一次オプティマイザを使用して簡単に実装できる、効果的なスムーズペナルティ関数を提案します。
広範な実験が実施され、特に複雑なタスクにおいて、安定したトレーニングプロセスによるポリシーのパフォーマンスと制約の満足度の点で、EPO がベースラインを上回るパフォーマンスを示すことが示されました。

要約(オリジナル)

In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.

arxiv情報

著者	Shiqing Gao,Jiaxin Ding,Luoyi Fu,Xinbing Wang,Chenghu Zhou
発行日	2024-07-22 10:57:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー