Multi-State TD Target for Model-Free Reinforcement Learning

要約

時間差分 (TD) 学習は、TD ターゲットを使用して状態または状態とアクションのペアの値推定を更新する強化学習の基本的な手法です。
この目標は、即時の報酬とその後の状態の推定値の両方を組み込むことにより、真の値の推定値が改善されたことを表します。
従来、TD 学習は単一の後続状態の値に依存していました。
我々は、複数の後続状態の推定値を利用する拡張マルチステート TD (MSTD) ターゲットを提案します。
この新しい MSTD コンセプトに基づいて、2 つのモードでのリプレイバッファの管理を含む完全なアクタークリティカルアルゴリズムを開発し、詳細な決定論的ポリシー最適化 (DDPG) およびソフトアクタークリティカル (SAC) と統合します。
実験結果は、MSTD ターゲットを採用したアルゴリズムが従来の方法と比較して学習パフォーマンスを大幅に向上させることを示しています。コードは GitHub で提供されています。

要約(オリジナル)

Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. Traditionally, TD learning relies on the value of a single subsequent state. We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states. Building on this new MSTD concept, we develop complete actor-critic algorithms that include management of replay buffers in two modes, and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Experimental results demonstrate that algorithms employing the MSTD target significantly improve learning performance compared to traditional methods.The code is provided on GitHub.

arxiv情報

著者	Wuhao Wang,Zhiyong Chen,Lepeng Zhang
発行日	2024-07-01 03:21:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-State TD Target for Model-Free Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー