Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning

要約

強化学習 (RL) エージェントにとっての重要な課題は、学習に外部/専門家 1 のアドバイスを組み込むことです。
外部アドバイスによって RL エージェントの学習を形成できるアルゴリズムの望ましい目標には、(a) ポリシーの不変性を維持すること、および (a) ポリシーの不変性を維持すること。
(b) エージェントの学習を加速する。
(c) 任意のアドバイスから学ぶ[3]。
この課題に対処するために、この論文では、シェイピングバンディットと呼ばれる多腕のバンディットとして外部アドバイスを RL に組み込む問題を定式化します。
シェイピングバンディットの各アームの報酬は、エキスパートに従うことによって、または真の環境報酬を学習するデフォルトの RL アルゴリズムに従うことによって得られるリターンに対応します。非定常を考慮しない既存のバンディットおよびシェイピングアルゴリズムを直接適用することを示します。
基礎となる収益の性質によっては、悪い結果が生じる可能性があります。
したがって、エキスパートポリシーまたはデフォルトの RL アルゴリズムに従った場合の長期的な結果について推論する、異なる仮定に基づいて構築された 3 つの異なる整形アルゴリズムである UCB-PIES (UPIES)、Racing-PIES (RPIES)、および Lazy PIES (LPIES) を提案します。
4 つの異なる設定での実験では、これらの提案されたアルゴリズムが上記の目標を達成するのに対し、他のアルゴリズムは達成できないことが示されています。

要約(オリジナル)

A key challenge for a reinforcement learning (RL) agent is to incorporate external/expert1 advice in its learning. The desired goals of an algorithm that can shape the learning of an RL agent with external advice include (a) maintaining policy invariance; (b) accelerating the learning of the agent; and (c) learning from arbitrary advice [3]. To address this challenge this paper formulates the problem of incorporating external advice in RL as a multi-armed bandit called shaping-bandits. The reward of each arm of shaping bandits corresponds to the return obtained by following the expert or by following a default RL algorithm learning on the true environment reward.We show that directly applying existing bandit and shaping algorithms that do not reason about the non-stationary nature of the underlying returns can lead to poor results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES (LPIES) three different shaping algorithms built on different assumptions that reason about the long-term consequences of following the expert policy or the default RL algorithm. Our experiments in four different settings show that these proposed algorithms achieve the above-mentioned goals whereas the other algorithms fail to do so.

arxiv情報

著者	Yash Satsangi,Paniz Behboudian
発行日	2023-09-18 11:41:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー