On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

要約

オンポリシー強化学習 (RL) アルゴリズムは、i.i.d を使用してポリシーの更新を実行します。
現在のポリシーによって収集された軌跡。
ただし、有限数の軌跡のみを観察した後、ポリシーに基づくサンプリングによって、予想されるポリシーに基づくデータ分布と一致しないデータが生成される可能性があります。
このサンプリングエラーにより、ノイズの多い更新やポリシーに基づくデータの非効率的な学習が発生します。
ポリシー評価設定における最近の研究では、非 i.i.d、オフポリシーサンプリングの方が、オンポリシーサンプリングよりも低いサンプリング誤差でデータを生成できることがわかっています。
この観察に基づいて、オンポリシーポリシー勾配アルゴリズムのデータ効率を向上させるために、適応的なオフポリシーサンプリング手法を導入します。
私たちの手法である Proximal Robust On-Policy Sampling (PROPS) は、現在のポリシーに対してアンダーサンプリングされたサンプリングアクションの確率を高める動作ポリシーを使用してデータを収集することで、サンプリングエラーを削減します。
ポリシーに準拠したアルゴリズムで一般的に行われているように、古いポリシーからのデータを破棄するのではなく、PROPS はデータ収集を使用して、以前に収集されたデータの分布をほぼポリシーに準拠するように調整します。
連続アクション MuJoCo ベンチマークタスクと離散アクションタスクの両方で PROPS を経験的に評価し、(1) PROPS がトレーニング全体を通じてサンプリング誤差を減少させ、(2) オンポリシーポリシー勾配アルゴリズムのデータ効率を向上させることを実証します。
私たちの取り組みにより、ポリシー内とポリシー外の二分法におけるニュアンスに対する RL コミュニティの理解が深まりました。つまり、ポリシー上での学習には、ポリシー上でのサンプリングではなく、ポリシー上でのデータが必要です。

要約(オリジナル)

On-policy reinforcement learning (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies — as is commonly done in on-policy algorithms — PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community’s understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.

arxiv情報

著者	Nicholas E. Corrado,Josiah P. Hanna
発行日	2023-11-14 16:37:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー