Correcting discount-factor mismatch in on-policy policy gradient methods

要約

政策勾配定理は、アクション値、アクション尤度の勾配、および \emph{割引定常分布} と呼ばれる割引を伴う状態分布の 3 つの要素に関して、ポリシー勾配の便利な形式を与えます。
しかし、ポリシー勾配定理に基づいて一般的に使用されているオンポリシー手法では、状態分布の割引係数が無視されます。これは技術的に正しくなく、一部の環境では学習動作の退化を引き起こす可能性もあります。
既存の解決策は、$\gamma^t$ を勾配推定の係数として使用することで、この不一致を修正します。
ただし、この解決策は広く採用されておらず、後の状態が前の状態と類似しているタスクではうまく機能しません。
多くの既存の勾配推定器に組み込むことができる、割引定常分布を考慮した新しい分布補正を導入します。
私たちの補正は、より低い分散による $\gamma^t$ 補正に伴うパフォーマンスの低下を回避します。
重要なのは、未修正の推定量と比較して、私たちのアルゴリズムは、特定の環境で次善のポリシーを回避するために改善された状態強調を提供し、いくつかの OpenAI ジムおよび DeepMind スイートのベンチマークで元のパフォーマンスと一貫して一致またはそれを超えていることです。

要約(オリジナル)

The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the \emph{discounted stationary distribution}. But commonly used on-policy methods based on the policy gradient theorem ignores the discount factor in the state distribution, which is technically incorrect and may even cause degenerate learning behavior in some environments. An existing solution corrects this discrepancy by using $\gamma^t$ as a factor in the gradient estimate. However, this solution is not widely adopted and does not work well in tasks where the later states are similar to earlier states. We introduce a novel distribution correction to account for the discounted stationary distribution that can be plugged into many existing gradient estimators. Our correction circumvents the performance degradation associated with the $\gamma^t$ correction with a lower variance. Importantly, compared to the uncorrected estimators, our algorithm provides improved state emphasis to evade suboptimal policies in certain environments and consistently matches or exceeds the original performance on several OpenAI gym and DeepMind suite benchmarks.

arxiv情報

著者	Fengdi Che,Gautham Vasan,A. Rupam Mahmood
発行日	2023-06-23 04:10:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Correcting discount-factor mismatch in on-policy policy gradient methods

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー