Stabilizing Temporal Difference Learning via Implicit Stochastic Approximation

要約

時間差（TD）学習は、強化学習（RL）における基礎的なアルゴリズムである。40年近くにわたり、TD学習は応用RLの主力として、またより複雑で特殊なアルゴリズムの構築ブロックとして機能してきた。しかし、広く使われているにもかかわらず、欠点がないわけではない。ステップサイズの選択を誤ると、推定値の誤差が劇的に増大し、収束が遅くなる。その結果、実際には、研究者は適切なステップサイズを特定するために試行錯誤を繰り返さなければならない。これに代わるものとして、我々はTD更新を固定小数点方程式に再定式化する暗黙的TDアルゴリズムを提案する。これらの更新は、計算効率を犠牲にすることなく、より安定で、ステップサイズの影響を受けにくい。さらに、我々の理論解析により、漸近収束保証と有限時間誤差境界を確立する。我々の結果は、最新のRLタスクに対する頑健性と実用性を実証し、暗黙的TDが政策評価と値近似のための汎用的なツールであることを立証する。

要約(オリジナル)

Temporal Difference (TD) learning is a foundational algorithm in reinforcement learning (RL). For nearly forty years, TD learning has served as a workhorse for applied RL as well as a building block for more complex and specialized algorithms. However, despite its widespread use, it is not without drawbacks, the most prominent being its sensitivity to step size. A poor choice of step size can dramatically inflate the error of value estimates and slow convergence. Consequently, in practice, researchers must use trial and error in order to identify a suitable step size — a process that can be tedious and time consuming. As an alternative, we propose implicit TD algorithms that reformulate TD updates into fixed-point equations. These updates are more stable and less sensitive to step size without sacrificing computational efficiency. Moreover, our theoretical analysis establishes asymptotic convergence guarantees and finite-time error bounds. Our results demonstrate their robustness and practicality for modern RL tasks, establishing implicit TD as a versatile tool for policy evaluation and value approximation.

arxiv情報

著者	Hwanwoo Kim,Panos Toulis,Eric Laber
発行日	2025-05-02 15:57:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Stabilizing Temporal Difference Learning via Implicit Stochastic Approximation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー