Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

要約

オンポリシー対応と比較して、オフポリシーモデルフリー深層強化学習は、以前に収集したデータを繰り返し使用することにより、データ効率を向上させることができます。しかし、エージェントの政策の基礎となる分布と収集されたデータの間の不一致が大きくなると、オフポリシー学習は困難になる。この不一致を補うために、よく研究されている重要度サンプリングやオフポリシー政策勾配法が提案されたが、通常、長い軌道を収集する必要があり、勾配の消失/爆発や多くの有用な経験の破棄といった新たな問題を誘発し、結局、計算量が増大する。さらに、連続的な行動領域や決定論的な深層ニューラルネットワークで近似された政策への一般化は厳しく制限されている。これらの限界を克服するために、我々は、連続制御におけるこのような不一致の影響を緩和するために、新しい政策類似性尺度を導入する。我々の方法は、決定論的なポリシーネットワークに適用可能な、適切なシングルステップのオフポリシー補正を提供する。理論的および実証的な研究により、「安全な」オフポリシー学習を実現し、Q-learningとポリシー最適化における学習率の効果的なスケジュールにより、競合する方法よりも少ないステップで高いリターンを達成し、最先端技術を大幅に改善できることを示す。

要約(オリジナル)

Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent’s policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks. Theoretical and empirical studies demonstrate that it can achieve a ‘safe’ off-policy learning and substantially improve the state-of-the-art by attaining higher returns in fewer steps than the competing methods through an effective schedule of the learning rate in Q-learning and policy optimization.

arxiv情報

著者	Baturay Saglam,Dogan C. Cicek,Furkan B. Mutlu,Suleyman S. Kozat
発行日	2023-06-05 13:32:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー