Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

要約

カルバック・ライブラー (KL) とエントロピー正則化強化学習 (RL) を抽象化したミラー降下値反復 (MDVI) は、最近の高性能の実用的な RL アルゴリズムの基礎として機能しています。
しかし、実際には関数近似が使用されているにもかかわらず、MDVI の理論的理解は表形式のマルコフ決定プロセス (MDP) に限定されています。
無限水平線形MDP、生成モデル、およびG最適設計の設定の下で、確率$1-\delta$の$\varepsilon$最適政策を特定するために必要なサンプルの複雑さを通じて、線形関数近似を使用したMDVIを研究します。
我々は、次の状態の推定最適値関数の分散によって重み付けされた最小二乗回帰が、ミニマックス最適性を達成するために重要であることを実証します。
この観察に基づいて、無限水平線形 MDP に対してほぼ最小の最適なサンプル複雑さを達成する最初の理論的アルゴリズムである分散加重最小二乗 MDVI (VWLS-MDVI) を紹介します。
さらに、値ベースの深い RL、Deep Variance Weighting (DVW) のための実用的な VWLS アルゴリズムを提案します。
私たちの実験では、DVW が一連の MinAtar ベンチマークで一般的な値ベースのディープ RL アルゴリズムのパフォーマンスを向上させることを示しています。

要約(オリジナル)

Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear function approximation through its sample complexity required to identify an $\varepsilon$-optimal policy with probability $1-\delta$ under the settings of an infinite-horizon linear MDP, generative model, and G-optimal design. We demonstrate that least-squares regression weighted by the variance of an estimated optimal value function of the next state is crucial to achieving minimax optimality. Based on this observation, we present Variance-Weighted Least-Squares MDVI (VWLS-MDVI), the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs. Furthermore, we propose a practical VWLS algorithm for value-based deep RL, Deep Variance Weighting (DVW). Our experiments demonstrate that DVW improves the performance of popular value-based deep RL algorithms on a set of MinAtar benchmarks.

arxiv情報

著者	Toshinori Kitamura,Tadashi Kozuno,Yunhao Tang,Nino Vieillard,Michal Valko,Wenhao Yang,Jincheng Mei,Pierre Ménard,Mohammad Gheshlaghi Azar,Rémi Munos,Olivier Pietquin,Matthieu Geist,Csaba Szepesvári,Wataru Kumagai,Yutaka Matsuo
発行日	2023-05-22 16:13:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー