Regret-Based Optimization for Robust Reinforcement Learning

要約

深層強化学習 (DRL) ポリシーは、観測における小さな敵対的ノイズに対して脆弱であることが示されています。
このような敵対的なノイズは、安全性が重要な環境で悲惨な結果をもたらす可能性があります。
たとえば、自動運転車が近くの標識 (例: 速度制限標識として認識されるように物理的に変更された一時停止標識) またはオブジェクト (例: 木として認識されるように変更された車) について敵対的に摂動された感覚観測を受け取ると、致命的となる可能性があります。
強化学習アルゴリズムを観測摂動敵対者に対してロバストにするための既存のアプローチは、反復ごとに生成される敵対者の例に対して反復的に改善する反応的アプローチに焦点を合わせてきました。
このようなアプローチは、通常の RL メソッドよりも改善されることが示されていますが、それらは反応的であり、トレーニング中に特定のカテゴリの敵対的な例が生成されない場合、大幅に悪化する可能性があります。
そのために、期待値ではなく、十分に研究されたロバスト性指標である後悔を直接最適化することに依存する、より積極的なアプローチを追求します。
受け取った「観察」に対する観察の「近隣」に対する最大の後悔を最小限に抑える原則に基づいたアプローチを提供します。
私たちの後悔基準は、既存の価値ベースおよびポリシーベースの Deep RL メソッドを変更するために使用できます。
私たちのアプローチは、堅牢な Deep RL の主要なアプローチに対して、さまざまなベンチマークでパフォーマンスを大幅に改善することを示しています。

要約(オリジナル)

Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a ‘neighborhood’ of observations to the received ‘observation’. Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.

arxiv情報

著者	Roman Belaire,Pradeep Varakantham,David Lo
発行日	2023-02-15 02:21:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Regret-Based Optimization for Robust Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー