The Virtues of Pessimism in Inverse Reinforcement Learning

要約

逆強化学習 (IRL) は、専門家のデモンストレーションから複雑な動作を学習するための強力なフレームワークです。
ただし、従来は、内部ループで計算コストのかかる強化学習 (RL) 問題を繰り返し解く必要がありました。
インナーループ RL で専門家のデモンストレーションを活用することで、探索の負担を軽減することが望ましいです。
一例として、最近の研究では、学習者に報酬の高いエキスパート状態を通知するために、学習者をエキスパート状態にリセットします。
しかし、そのようなアプローチは現実の世界では実現不可能です。
この研究では、IRL の RL サブルーチンを高速化するための代替アプローチである \emph{pessimism} を検討します。つまり、オフライン RL アルゴリズムを使用してインスタンス化された、専門家のデータ配布に近づきます。
オフライン RL と IRL の間の接続を形式化し、任意のオフライン RL アルゴリズムを使用して IRL のサンプル効率を向上できるようにします。
私たちは、オフライン RL アルゴリズムの有効性と、それが IRL 手順の一部としてどの程度うまく機能するかの間に強い相関関係があることを実証することで、理論を実験的に検証します。
IRL 手順の一部として強力なオフライン RL アルゴリズムを使用することにより、従来技術よりも大幅に効率的にエキスパートのパフォーマンスに一致するポリシーを見つけることができます。

要約(オリジナル)

Inverse Reinforcement Learning (IRL) is a powerful framework for learning complex behaviors from expert demonstrations. However, it traditionally requires repeatedly solving a computationally expensive reinforcement learning (RL) problem in its inner loop. It is desirable to reduce the exploration burden by leveraging expert demonstrations in the inner-loop RL. As an example, recent work resets the learner to expert states in order to inform the learner of high-reward expert states. However, such an approach is infeasible in the real world. In this work, we consider an alternative approach to speeding up the RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert’s data distribution, instantiated via the use of offline RL algorithms. We formalize a connection between offline RL and IRL, enabling us to use an arbitrary offline RL algorithm to improve the sample efficiency of IRL. We validate our theory experimentally by demonstrating a strong correlation between the efficacy of an offline RL algorithm and how well it works as part of an IRL procedure. By using a strong offline RL algorithm as part of an IRL procedure, we are able to find policies that match expert performance significantly more efficiently than the prior art.

arxiv情報

著者	David Wu,Gokul Swamy,J. Andrew Bagnell,Zhiwei Steven Wu,Sanjiban Choudhury
発行日	2024-02-08 16:50:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Virtues of Pessimism in Inverse Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー