Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

要約

Hindsight Experience Replay (HER) は、強化学習 (RL) で使用される手法で、スパース報酬を使用して目標ベースのロボット操作タスクを解決する、オフポリシー RL ベースのエージェントのトレーニングに非常に効率的であることが証明されています。
HER は、過去の経験で犯した間違いから学習することで RL ベースのエージェントのサンプル効率を向上させますが、環境を探索する際には何の指針も提供しません。
このリプレイ戦略を使用してエージェントをトレーニングするには大量の経験が必要となるため、トレーニング時間が非常に長くなります。
この論文では、他のより複雑なタスクを学習しながら、探索中にエージェントをよりやりがいのあるアクションに導くために、単純なタスクを解決するために以前に学習された原始的な動作を使用する方法を提案します。
ただし、このガイダンスは手動で設計されたカリキュラムによって実行されるのではなく、批評家ネットワークを使用して、以前に学習した基本的なポリシーによって提案されたアクションを使用するかどうかを各タイムステップで決定します。
いくつかのブロック操作タスクにおける HER およびこのアルゴリズムの他のより効率的なバリエーションと比較して、メソッドを評価します。
私たちは、サンプル効率と計算時間の両方の観点から、提案された方法を使用すると、エージェントが成功したポリシーをより速く学習できることを実証します。
コードは https://github.com/franroldans/qmp-her で入手できます。

要約(オリジナル)

Hindsight Experience Replay (HER) is a technique used in reinforcement learning (RL) that has proven to be very efficient for training off-policy RL-based agents to solve goal-based robotic manipulation tasks using sparse rewards. Even though HER improves the sample efficiency of RL-based agents by learning from mistakes made in past experiences, it does not provide any guidance while exploring the environment. This leads to very large training times due to the volume of experience required to train an agent using this replay strategy. In this paper, we propose a method that uses primitive behaviours that have been previously learned to solve simple tasks in order to guide the agent toward more rewarding actions during exploration while learning other more complex tasks. This guidance, however, is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed by the previously-learned primitive policies. We evaluate our method by comparing its performance against HER and other more efficient variations of this algorithm in several block manipulation tasks. We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time. Code is available at https://github.com/franroldans/qmp-her.

arxiv情報

著者	Francisco Roldan Sanchez,Qiang Wang,David Cordova Bulens,Kevin McGuinness,Stephen Redmond,Noel O’Connor
発行日	2023-11-19 15:55:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー