Fast Convergence of Softmax Policy Mirror Ascent

要約

Natural Policy Gradient（NPG）は、一般的なポリシー最適化アルゴリズムであり、確率の空間での鏡の上昇と見なすことができます。
最近、Vaswani et al。
[2021]は、ロジットの二重空間での上昇を鏡に対応するポリシーグラデーション法を導入しました。
このアルゴリズムを改良し、アクション全体で正規化の必要性を削除し、結果の方法（SPMAと呼ばれる）を分析します。
表形式のMDPの場合、一定のステップサイズのSPMAがNPGの線形収束と一致し、一定のステップサイズ（加速）ソフトマックスポリシー勾配よりも速い収束を達成することを証明します。
大きな状態アクションスペースを処理するために、SPMAを拡張して、対数線形ポリシーパラメーター化を使用します。
NPGの場合とは異なり、SPMAを線形関数近似（FA）設定に一般化するには、互換性のある関数近似は必要ありません。
NPGの実用的な一般化であるMDPOとは異なり、線形FAを使用したSPMAには、凸型ソフトマックス分類の問題を解く必要があります。
SPMAが最適な値関数の近傍への線形収束を達成することを証明します。
SPMAを拡張して非線形FAを処理し、ムホコとアタリのベンチマークでの経験的パフォーマンスを評価します。
我々の結果は、SPMAがMDPO、PPO、TRPOと比較して、一貫して同様のパフォーマンスを達成していることを示しています。

要約(オリジナル)

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.

arxiv情報

著者	Reza Asad,Reza Babanezhad,Issam Laradji,Nicolas Le Roux,Sharan Vaswani
発行日	2025-05-30 01:46:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fast Convergence of Softmax Policy Mirror Ascent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー