Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity

要約

強化学習アルゴリズムでは、マルコフ決定プロセス (MDP) (制御されたマルコフ連鎖とも呼ばれる) における状態空間と行動空間の有限性が必要となることが多く、そのようなアルゴリズムを連続状態空間と行動空間に適用できるように文献でさまざまな努力がなされてきました。
この論文では、非常に穏やかな規則性条件 (特に、MDP の遷移カーネルの弱い連続性のみが関与する条件) では、状態とアクションの量子化 (量子化 Q ラーニングと呼ばれる) による標準 Borel MDP の Q 学習が収束することを示します。
さらに、この制限は、明示的なパフォーマンス限界または漸近的に最適であることが保証されたほぼ最適性をもたらす最適性方程式を満たします。
私たちのアプローチは、(i) 量子化を測定カーネルとして、したがって量子化された MDP を部分的に観察されたマルコフ決定プロセス (POMDP) として見ること、(ii) POMDP の Q 学習のほぼ最適性と収束結果を利用すること、(iii) 最後に、に基づいています。
、構築された POMDP の固定点に対応することを示す、弱連続カーネルを備えた MDP の有限状態モデル近似のほぼ最適性。
したがって、私たちの論文は、連続 MDP に対する Q 学習の適用性に関する非常に一般的な収束と近似の結果を示しています。

要約(オリジナル)

Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.

arxiv情報

著者	Ali Devran Kara,Naci Saldi,Serdar Yüksel
発行日	2023-09-07 17:42:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー