Rethinking Population-assisted Off-policy Reinforcement Learning

要約

タイトル: 人口支援オフポリシー強化学習の見直し

要約:
– グラディエントベースのアップデートと再生バッファでのデータ再利用により、オフポリシー強化学習（RL）アルゴリズムはサンプル効率が高いが、限られた探索により局所最適解に収束するのが困難となる。
– 一方、人口ベースのアルゴリズムは自然な探索戦略を提供するが、ヒューリスティックなブラックボックス演算子の効率が悪い。
– 近年のアルゴリズムでは、共有再生バッファを介してこれら2つの方法を統合しているが、人口最適化イテレーションからの多様なデータの使用がオフポリシーRLアルゴリズムに与える影響は十分に調査されていない。
– この論文では、人口ベースのアルゴリズムと組み合わせたオフポリシーRLアルゴリズムの使用について最初に分析し、人口データの使用が見落とされたエラーを導入し、パフォーマンスを損なう可能性があることを示す。
– この問題をテストするために、均一かつスケーラブルなトレーニング設計を提案し、OpenAI gymのロボット移動タスクで実験を行う。
– 結果は、人口データをオフポリシーRLに使用すると、トレーニング中に不安定さを引き起こし、パフォーマンスを低下させることができることを裏付けている。
– この問題を解決するために、よりオンポリシーのデータを提供するダブル再生バッファ設計を提案し、実験によりその有効性を示す。
– この結果は、これらのハイブリッド方法のトレーニングに関する実用的な洞察を提供している。

要約(オリジナル)

While off-policy reinforcement learning (RL) algorithms are sample efficient due to gradient-based updates and data reuse in the replay buffer, they struggle with convergence to local optima due to limited exploration. On the other hand, population-based algorithms offer a natural exploration strategy, but their heuristic black-box operators are inefficient. Recent algorithms have integrated these two methods, connecting them through a shared replay buffer. However, the effect of using diverse data from population optimization iterations on off-policy RL algorithms has not been thoroughly investigated. In this paper, we first analyze the use of off-policy RL algorithms in combination with population-based algorithms, showing that the use of population data could introduce an overlooked error and harm performance. To test this, we propose a uniform and scalable training design and conduct experiments on our tailored framework in robot locomotion tasks from the OpenAI gym. Our results substantiate that using population data in off-policy RL can cause instability during training and even degrade performance. To remedy this issue, we further propose a double replay buffer design that provides more on-policy data and show its effectiveness through experiments. Our results offer practical insights for training these hybrid methods.

arxiv情報

著者	Bowen Zheng,Ran Cheng
発行日	2023-05-04 15:53:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Rethinking Population-assisted Off-policy Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー