Improving Value Estimation Critically Enhances Vanilla Policy Gradient

要約

TRPOやPPOなどの最新の政策勾配アルゴリズムは、多くのRLタスクでバニラポリシーグラデーションを上回ります。
おおよその信頼地域を実施することは、実践の安定した政策改善につながるという一般的な信念に疑問を呈することで、より重要な要因は、各反復のより多くの値更新ステップからの値推定精度の向上であることを示します。
実証するために、反復あたりの値更新ステップの数を増やすだけで、バニラポリシーグラデーション自体が、すべての標準連続制御ベンチマーク環境でPPOに匹敵するまたはそれ以上のパフォーマンスを実現できることを示します。
重要なことに、バニラポリシーグラデーションに対するこの単純な変更は、ハイパーパラメーターの選択により大幅に堅牢であり、RLアルゴリズムがより効果的で使いやすくなる可能性を開きます。

要約(オリジナル)

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

arxiv情報

著者	Tao Wang,Ruipeng Zhang,Sicun Gao
発行日	2025-05-25 17:54:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー