On the Global Convergence of Risk-Averse Policy Gradient Methods with Expected Conditional Risk Measures

要約

リスクに敏感な強化学習 (RL) は、不確実な結果のリスクを制御し、確率性の高い逐次的意思決定問題において信頼性の高いパフォーマンスを確保するための一般的なツールとなっています。
ポリシー勾配 (PG) 手法はリスクに敏感な RL 向けに開発されていますが、これらの手法がリスク中立の場合 \citep{mei2020global,agarwal2021 Theory,cen2022fast,bhandari2024global} と同じグローバル収束保証を享受できるかどうかは不明のままです。
この論文では、予測条件付きリスク測定 (ECRM) と呼ばれる動的な時間一貫性のあるリスク測定のクラスを検討し、ECRM ベースの RL 問題に対する PG および自然政策勾配 (NPG) の更新を導出します。
次の 4 つの設定の下で、提案されたアルゴリズムの全体的な最適性と反復複雑度を提供します: (i) 制約付き直接パラメータ化を使用した PG、(ii) ソフトマックスパラメータ化と対数バリア正則化を使用した PG、(iii) ソフトマックスパラメータ化とエントロピー正則化を使用した NPG
(iv) 不正確なポリシー評価を使用して NPG を近似します。
さらに、リスク回避型 REINFORCE アルゴリズム \citep{williams1992simple} とリスク回避型 NPG アルゴリズム \citep{kakade2001natural} を確率的 Cliffwalk 環境でテストし、手法の有効性とリスク制御の重要性を実証します。

要約(オリジナル)

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While Policy Gradient (PG) methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case \citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive PG and Natural Policy Gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality {and iteration complexities} of the proposed algorithms under the following four settings: (i) PG with constrained direct parameterization, (ii) PG with softmax parameterization and log barrier regularization, (iii) NPG with softmax parameterization and entropy regularization, and (iv) approximate NPG with inexact policy evaluation. Furthermore, we test a risk-averse REINFORCE algorithm \citep{williams1992simple} and a risk-averse NPG algorithm \citep{kakade2001natural} on a stochastic Cliffwalk environment to demonstrate the efficacy of our methods and the importance of risk control.

arxiv情報

著者	Xian Yu,Lei Ying
発行日	2024-11-05 14:31:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On the Global Convergence of Risk-Averse Policy Gradient Methods with Expected Conditional Risk Measures

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー