Robust Average-Reward Markov Decision Processes

要約

堅牢なマルコフ決定プロセス (MDP) では、遷移カーネルの不確実性は、MDP の不確実性セットで最悪の場合のパフォーマンスを最適化するポリシーを見つけることによって対処されます。
文献の多くは割引 MDP に焦点を当てていますが、堅牢な平均報酬 MDP はほとんど調査されていません。
このホワイトペーパーでは、ロバストな平均報酬 MDP に焦点を当てます。ここでの目標は、不確実性セット全体で最悪の場合の平均報酬を最適化するポリシーを見つけることです。
最初に、割引 MDP を使用して平均報酬 MDP を概算するアプローチを採用します。
割引係数 $\gamma$ が $1$ になると、ロバストな割引価値関数がロバストな平均報酬に収束することを証明します。さらに、$\gamma$ が大きい場合、ロバストな割引 MDP の最適ポリシーも
堅牢な平均報酬の最適なポリシー。
さらに、堅牢な動的計画法のアプローチを設計し、理論的に最適への収束を特徴付けます。
次に、中間ステップとして割引 MDP を使用せずに、ロバストな平均報酬 MDP を直接調査します。
ロバストな平均報酬 MDP のロバストなベルマン方程式を導出し、その解から最適なポリシーを導出できることを証明し、その解、または同等に最適なロバストなポリシーを証明可能に見つけるロバストな相対値反復アルゴリズムをさらに設計します。

要約(オリジナル)

In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor $\gamma$ goes to $1$, and moreover, when $\gamma$ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.

arxiv情報

著者	Yue Wang,Alvaro Velasquez,George Atia,Ashley Prater-Bennette,Shaofeng Zou
発行日	2023-03-01 15:22:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robust Average-Reward Markov Decision Processes

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー