The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

要約

この論文は $d$ 次元の確率的近似再帰 $$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) $$ に関するもので、$
\{ \Phi_n \}$ は、パラメータ依存のノイズを許容する条件付きマルコフ特性を満たす、一般状態空間上の確率過程です。
主な結果は、平均流量に関する追加条件と、(DV3) として知られるドンスカー・ヴァラダン・リアプノフ・ドリフト条件のバージョンの下で確立されます: {(i)} $L_4$ の推定値の収束を暗示する適切なリアプノフ関数が構築されます。
。
{(ii)} 正規化誤差に対する通常の 1 次元 CLT と同様に、関数中心極限定理 (CLT) が確立されます。
CLT と組み合わせたモーメント境界は、正規化共分散 $\textsf{E} [ z_n z_n^T ]$ が CLT の漸近共分散に収束することを意味します。ここで、$z_n{=:} (\theta_n-\theta^*)/
\sqrt{\alpha_n}$。
{(iii)} CLT は正規化バージョン $z^{\text{PR}}_n{=:} \sqrt{n} [\theta^{\text{PR}}_n -\theta^*] に当てはまります
$、平均パラメータ $\theta^{\text{PR}}_n {=:} n^{-1} \sum_{k=1}^n\theta_k$ のうち、ステップサイズに関する標準の仮定に従う
。
さらに、CLT の共分散は Polyak と Ruppert の最小共分散と一致します。
{(iv)} $f$ と $\bar{f}$ が $\theta$ 内で線形であり、$\Phi$ が幾何学的にエルゴードなマルコフ連鎖であるが (DV3) を満たさない例を示します。
アルゴリズムは収束しますが、$\theta_n$ の 2 番目のモーメントには制限がなく、実際には発散します。
{\bf この arXiv バージョン 3 は、以前のバージョンの結果を大幅に拡張したものです。} 主な結果では、強化学習へのアプリケーションではよくあることですが、パラメーター依存のノイズが許容されるようになりました。

要約(オリジナル)

The paper concerns the $d$-dimensional stochastic approximation recursion, $$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) $$ where $ \{ \Phi_n \}$ is a stochastic process on a general state space, satisfying a conditional Markov property that allows for parameter-dependent noise. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3): {(i)} An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$. {(ii)} A functional central limit theorem (CLT) is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\textsf{E} [ z_n z_n^T ]$ to the asymptotic covariance in the CLT, where $z_n{=:} (\theta_n-\theta^*)/\sqrt{\alpha_n}$. {(iii)} The CLT holds for the normalized version $z^{\text{PR}}_n{=:} \sqrt{n} [\theta^{\text{PR}}_n -\theta^*]$, of the averaged parameters $\theta^{\text{PR}}_n {=:} n^{-1} \sum_{k=1}^n\theta_k$, subject to standard assumptions on the step-size. Moreover, the covariance in the CLT coincides with the minimal covariance of Polyak and Ruppert. {(iv)} An example is given where $f$ and $\bar{f}$ are linear in $\theta$, and $\Phi$ is a geometrically ergodic Markov chain but does not satisfy (DV3). While the algorithm is convergent, the second moment of $\theta_n$ is unbounded and in fact diverges. {\bf This arXiv version 3 represents a major extension of the results in prior versions.} The main results now allow for parameter-dependent noise, as is often the case in applications to reinforcement learning.

arxiv情報

著者	Vivek Borkar,Shuhang Chen,Adithya Devraj,Ioannis Kontoyiannis,Sean Meyn
発行日	2024-11-07 15:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー