Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

要約

強化学習 (RL) で安全なアクションを強制するための一般的なフレームワークは制約付き RL です。このフレームワークでは、予想されるコスト (またはその他のコスト尺度) に対する軌跡ベースの制約が安全性を強化するために使用されます。さらに重要なのは、期待される報酬を最大化しながらこれらの制約が強制されることです。
制約付き RL を解決するための最新のアプローチは、軌道ベースのコスト制約を、RL 手法にわずかな変更を加えて解決できる代理問題に変換します。
このようなアプローチの主な欠点は、各状態でのコスト制約が過大または過小評価されることです。
したがって、我々は、軌道ベースのコスト制約を変更せず、代わりに「良い」軌道を模倣し、政策を段階的に改善することによって生成される「悪い」軌道を回避するアプローチを提供する。
私たちは、報酬閾値 (学習によって変化します) と全体的なコスト制約を利用して、軌跡を「良い」または「悪い」とラベル付けするオラクルを採用します。
私たちのアプローチの主な利点は、あらゆる開始方針や一連の軌道から作業し、それを改善できることです。
徹底的な一連の実験により、私たちのアプローチは、予想コスト、CVaR コスト、さらには未知のコスト制約に関して、制約付き RL 問題を解決するためのトップベンチマークアプローチよりも優れたパフォーマンスを発揮できることを実証しました。

要約(オリジナル)

A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates “good” trajectories and avoids “bad” trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as “good” or “bad”. A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.

arxiv情報

著者	Huy Hoang,Tien Mai,Pradeep Varakantham
発行日	2023-12-26 07:55:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー