Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

要約

Hindsight Experience Replay (HER)は強化学習(RL)で使用される手法であり、スパース報酬を使用したゴールに基づくロボット操作タスクを解決するためのオフポリシーRLベースエージェントの訓練に非常に効率的であることが証明されている。HERは、過去の経験で犯したミスから学習することで、RLベースのエージェントのサンプル効率を向上させるが、環境を探索する際のガイダンスは提供しない。これは、この再生戦略を用いてエージェントを訓練するのに必要な経験量のため、訓練時間が非常に長くなる。本論文では、他のより複雑なタスクを学習しながら、探索中にエージェントをより報酬の高い行動へ誘導するために、単純なタスクを解決するために過去に学習された原始的な行動を使用する方法を提案する。しかし、この誘導は、手動で設計されたカリキュラムによって実行されるのではなく、各タイムステップにおいて、以前に学習された原始的な方針によって提案された行動を使用するか否かを決定するために、クリティックネットワークを使用する。我々は、いくつかのブロック操作タスクにおいて、HERやこのアルゴリズムのより効率的な他のバリエーションと性能を比較することにより、我々の手法を評価する。我々の提案する方法を用いると、エージェントはサンプルの効率と計算時間の両方において、より速く成功するポリシーを学習できることを示す。コードはhttps://github.com/franroldans/qmp-her。

要約(オリジナル)

Hindsight Experience Replay (HER) is a technique used in reinforcement learning (RL) that has proven to be very efficient for training off-policy RL-based agents to solve goal-based robotic manipulation tasks using sparse rewards. Even though HER improves the sample efficiency of RL-based agents by learning from mistakes made in past experiences, it does not provide any guidance while exploring the environment. This leads to very large training times due to the volume of experience required to train an agent using this replay strategy. In this paper, we propose a method that uses primitive behaviours that have been previously learned to solve simple tasks in order to guide the agent toward more rewarding actions during exploration while learning other more complex tasks. This guidance, however, is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed by the previously-learned primitive policies. We evaluate our method by comparing its performance against HER and other more efficient variations of this algorithm in several block manipulation tasks. We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time. Code is available at https://github.com/franroldans/qmp-her.

arxiv情報

著者	Francisco Roldan Sanchez,Qiang Wang,David Cordova Bulens,Kevin McGuinness,Stephen Redmond,Noel O’Connor
発行日	2023-10-03 06:49:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー