Optimistic Q-learning for average reward and episodic reinforcement learning

要約

すべてのポリシーで、頻繁な状態$ s_0 $を訪問する時間は予想または一定の確率で$ h $で上限に縛られているという、基礎となるMDPの追加の仮定の下で、平均報酬補強学習における後悔の最小化のための楽観的なQ学習アルゴリズムを提示します。
私たちの設定は、エピソード設定を厳密に一般化し、平均的な報酬設定のモデルフリーアルゴリズムに関するほとんどの以前の文献で作成された境界ヒット時間\ textit {すべての状態}の仮定よりもはるかに制限が少ないです。
$ \ tilde {o}（h^5 s \ sqrt {at}）$の後悔の境界を示します。ここで、$ s $ and $ a $は状態と行動の数であり、$ t $は地平線です。
私たちの作品の主要な技術的な斬新さは、$ \ overline {l} $演算子の導入です。
与えられた仮定の下で、$ \ overline {l} $演算子は、割引率が1ドルである平均報酬設定であっても、厳格な収縮（スパン）を持っていることを示します。
当社のアルゴリズム設計では、エピソードQラーニングのアイデアを使用して、このオペレーターを推定および適用します。
したがって、私たちは、エピソードおよび非エピソードの設定における後悔の最小化の統一された見解を提供します。

要約(オリジナル)

We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state $s_0$ is finite and upper bounded by $H$, either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is the introduction of an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Under the given assumption, we show that the $\overline{L}$ operator has a strict contraction (in span) even in the average-reward setting where the discount factor is $1$. Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.

arxiv情報

著者	Priyank Agrawal,Shipra Agrawal
発行日	2025-03-24 16:42:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Optimistic Q-learning for average reward and episodic reinforcement learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー