Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

要約

リスクを管理下に置くことは、金融、ロボット工学、自動運転などの現実世界の意思決定状況で期待される報酬を最大化することよりも重要であることがよくあります。リスク測定の最も自然な選択は分散です。
マイナス部分。
代わりに、その平均の下で確率変数の負の偏差を捉える (下向きの) 半分散は、リスク回避の提案により適しています。
この論文は、強化学習 w.r.t における平均半分散 (MSV) 基準を最適化することを目的としています。
安定した報酬分配。
セミバリアンスは時間に矛盾があり、標準のベルマン方程式を満たさないため、従来の動的計画法は MSV 問題に直接適用できません。
この課題に取り組むために、私たちは摂動解析 (PA) 理論に頼り、MSV の性能差の式を確立します。
MSV 問題は、政策依存の報酬関数を使用して一連の RL 問題を繰り返し解くことで解決できることを明らかにします。
さらに、ポリシー勾配理論と信頼領域法に基づく 2 つのオンポリシーアルゴリズムを提案します。
最後に、単純なバンディット問題から MuJoCo での連続制御タスクまで、さまざまな実験を行い、提案手法の有効性を示します。

要約(オリジナル)

Keeping risk under control is often more crucial than maximizing expected rewards in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, which penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures the negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady reward distribution. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.

arxiv情報

著者	Xiaoteng Ma,Shuai Ma,Li Xia,Qianchuan Zhao
発行日	2023-03-08 09:47:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー