Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs


具体的には、推定器は、長さ$ t $の軌跡を使用する場合、$ o(1 / \ sqrt {t})$収束率を達成します。
特に、このレートは、拡散の混合時間と採用されている基底関数の数の両方で、$ T $がほぼ直線的にスケーリングする限り達成されます。


We study the estimation of the value function for continuous-time Markov diffusion processes using a single, discretely observed ergodic trajectory. Our work provides non-asymptotic statistical guarantees for the least-squares temporal-difference (LSTD) method, with performance measured in the first-order Sobolev norm. Specifically, the estimator attains an $O(1 / \sqrt{T})$ convergence rate when using a trajectory of length $T$; notably, this rate is achieved as long as $T$ scales nearly linearly with both the mixing time of the diffusion and the number of basis functions employed. A key insight of our approach is that the ellipticity inherent in the diffusion process ensures robust performance even as the effective horizon diverges to infinity. Moreover, we demonstrate that the Markovian component of the statistical error can be controlled by the approximation error, while the martingale component grows at a slower rate relative to the number of basis functions. By carefully balancing these two sources of error, our analysis reveals novel trade-offs between approximation and statistical errors.


著者 Wenlong Mou
発行日 2025-02-06 18:39:03+00:00
カテゴリー: cs.LG, math.OC, math.PR, math.ST, stat.TH