A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

要約

リスクに敏感な補強学習のアプリケーションに動機付けられ、割引報酬マルコフ決定プロセス（MDP）で平均分散最適化を研究します。
具体的には、ポリシー評価のために、線形関数近似（LFA）を使用して、時間差（TD）学習アルゴリズムを分析します。
（i）平均二乗意味で保持する有限サンプルの境界を導き出し、（ii）正則化の有無にかかわらず、尾の繰り返し平均化の下で高い確率で導き出します。
私たちの境界は、$ t $ iterations後の初期誤差と$ o（1/t）$の収束率に対する指数関数的に減衰する依存性を示します。
さらに、正規化されたTDバリアントの場合、バウンドはユニバーサルステップサイズを保持します。
次に、同時摂動確率的近似（SPSA）ベースのアクターアップデートをLFA評論家と統合し、$ o（n^{-1/4}）$収束保証を確立します。
これらの結果は、リスク測定としての分散に焦点を当てて、強化学習におけるリスクに敏感なアクター批判的な方法の有限サンプルの理論的保証を確立します。

要約(オリジナル)

Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of $O(1/t)$ after $t$ iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an $O(n^{-1/4})$ convergence guarantee, where $n$ denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

arxiv情報

著者	Tejaram Sangadi,L. A. Prashanth,Krishna Jagannathan
発行日	2025-03-12 14:32:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー