Guarded Policy Optimization with Imperfect Online Demonstrations

要約

教師-生徒フレームワーク(TSF)は、教師エージェントが生徒エージェントに介入し、オンラインデモンストレーションを提供することによって、生徒エージェントのトレーニングを保護する強化学習の設定である。教師エージェントは、生徒エージェントの学習プロセスに介入し、安全性の保証と探索のガイダンスを提供する完璧なタイミングと能力を持っています。しかし、多くの現実世界では、教師ポリシーを得るにはコストがかかるか、あるいは不可能でさえある。本研究では、性能の良い教師という仮定を緩和し、性能はそこそこでも劣る任意の教師ポリシーを取り入れることができる新しい方法を開発する。教師-生徒共有制御(TS2C)と呼ばれる政策外強化学習アルゴリズムをインスタンス化し、軌跡に基づく値推定に基づく教師介入を取り入れる。理論解析により、提案するTS2Cアルゴリズムが、教師自身の性能に影響されることなく、効率的な探索と実質的な安全性保証を達成することを検証する。様々な連続制御タスクに対する実験により、本方法が低い訓練コストを維持しながら、異なる性能レベルの教師ポリシーを利用できることが示された。さらに、保持されたテスト環境において、より高い累積報酬の点で、生徒ポリシーは不完全な教師ポリシーを凌駕する。コードは https://metadriverse.github.io/TS2C で公開されている。

要約(オリジナル)

The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher’s own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.

arxiv情報

著者	Zhenghai Xue,Zhenghao Peng,Quanyi Li,Zhihan Liu,Bolei Zhou
発行日	2023-03-03 06:24:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Guarded Policy Optimization with Imperfect Online Demonstrations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー