Fast Convergence of Softmax Policy Mirror Ascent

要約

自然ポリシー勾配 (NPG) は一般的なポリシー最適化アルゴリズムであり、確率空間におけるミラー上昇とみなすことができます。
最近、Vaswani ら。
[2021] は、ロジットの双対空間におけるミラーアセントに対応するポリシー勾配法を導入しました。
このアルゴリズムを改良して、アクション全体にわたる正規化の必要性を取り除き、結果として得られるメソッド (SPMA と呼ばれます) を分析します。
表形式 MDP の場合、一定ステップサイズの SPMA が NPG の線形収束と一致し、一定ステップサイズ (加速された) ソフトマックスポリシー勾配よりも高速な収束を達成することを証明します。
大規模な状態アクション空間を処理するために、対数線形ポリシーパラメータ化を使用するように SPMA を拡張します。
NPG の場合とは異なり、SPMA を線形関数近似 (FA) 設定に一般化するには、互換性のある関数近似は必要ありません。
NPG を実際に一般化した MDPO とは異なり、線形 FA を使用する SPMA では、凸ソフトマックス分類問題を解くだけで済みます。
SPMA が最適値関数の近傍への線形収束を達成することを証明します。
SPMA を拡張して非線形 FA を処理し、MuJoCo および Atari ベンチマークでの経験的パフォーマンスを評価します。
私たちの結果は、SPMA が MDPO、PPO、TRPO と比較して一貫して同等以上のパフォーマンスを達成していることを示しています。

要約(オリジナル)

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.

arxiv情報

著者	Reza Asad,Reza Babanezhad,Issam Laradji,Nicolas Le Roux,Sharan Vaswani
発行日	2024-11-18 20:27:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fast Convergence of Softmax Policy Mirror Ascent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー