Offline Reinforcement Learning with Imputed Rewards

要約

オフライン強化学習 (ORL) は、コスト、安全性、または正確なシミュレーション環境の欠如により、環境とのインタラクションを厳密に制限する必要があるアプリケーションでエージェントをトレーニングするための堅牢なソリューションを提供します。
現実世界への人工エージェントの展開を容易にする可能性があるにもかかわらず、オフライン強化学習には通常、グラウンドトゥルースの報酬で注釈が付けられた非常に多くのデモンストレーションが必要です。
したがって、最先端の ORL アルゴリズムは、データが不足しているシナリオでは適用することが困難または不可能になる可能性があります。
この論文では、報酬の注釈が付けられた環境遷移の非常に限られたサンプルから報酬シグナルを推定できる、シンプルだが効果的な報酬モデルを提案します。
報酬信号がモデル化されたら、報酬モデルを使用して報酬のない遷移の大規模なサンプルに報酬を代入することで、ORL 手法の適用が可能になります。
いくつかの D4RL 連続移動タスクに対するアプローチの可能性を実証します。
私たちの結果は、元のデータセットの報酬ラベル付き遷移の 1\% のみを使用して、学習済み報酬モデルが残りの 99\% の遷移に報酬を代入でき、そこからオフライン強化学習を使用してパフォーマンスの高いエージェントを学習できることを示しています。
。

要約(オリジナル)

Offline Reinforcement Learning (ORL) offers a robust solution to training agents in applications where interactions with the environment must be strictly limited due to cost, safety, or lack of accurate simulation environments. Despite its potential to facilitate deployment of artificial agents in the real world, Offline Reinforcement Learning typically requires very many demonstrations annotated with ground-truth rewards. Consequently, state-of-the-art ORL algorithms can be difficult or impossible to apply in data-scarce scenarios. In this paper we propose a simple but effective Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Once the reward signal is modeled, we use the Reward Model to impute rewards for a large sample of reward-free transitions, thus enabling the application of ORL techniques. We demonstrate the potential of our approach on several D4RL continuous locomotion tasks. Our results show that, using only 1\% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99\% of the transitions, from which performant agents can be learned using Offline Reinforcement Learning.

arxiv情報

著者	Carlo Romeo,Andrew D. Bagdanov
発行日	2024-07-15 15:53:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Offline Reinforcement Learning with Imputed Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー