Towards Minimax Optimality of Model-based Robust Reinforcement Learning

要約

私たちは、公称カーネルの生成モデルへのアクセスのみを与えられた場合に、 \emph{Robust} 割引マルコフ決定プロセス (RMDP) で $\epsilon$ 最適なポリシーを取得するサンプルの複雑さを研究します。
この問題は非ロバストケースで広く研究されており、 $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\ で推定された経験的 MDP に適用される計画アプローチはどれも有効であることが知られています。
Mid \mid}{\epsilon^2})$ サンプルは、minimax 最適である $\epsilon$-optimal ポリシーを提供します。
堅牢なケースの結果ははるかに稀です。
$sa$- (それぞれ $s$-) 長方形の不確実性セットの場合、最もよく知られているサンプルの複雑さは $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \
mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2
})$)、特定のアルゴリズムの場合、および不確実性セットが総変動 (TV)、KL、またはカイ 2 乗発散に基づいている場合。
この論文では、$L_p$-ball (TV ケースの復元) で定義された不確実性セットを検討し、経験的 RMDP に適用される \emph{any} 計画アルゴリズム (解の高精度保証付き) のサンプルの複雑さを研究します。
生成モデルを使用して推定されます。
一般的なケースでは、両方のサンプルの複雑さ $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ を証明します。
$sa$- および $s$-長方形のケース (それぞれ $\mid S \mid$ および $\mid S \mid\mid A \mid$ の改良)。
不確実性のサイズが十分に小さい場合、サンプルの複雑さを $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2} まで改善します
)$、非ロバストなケースの下限を初めて回復し、不確実性のサイズが十分に小さい場合にはロバストな下限を回復します。

要約(オリジナル)

We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.

arxiv情報

著者	Pierre Clavier,Erwan Le Pennec,Matthieu Geist
発行日	2023-10-17 16:56:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Minimax Optimality of Model-based Robust Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー