MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

要約

タイトル：オフライン強化学習と観察からのイミテーション学習を統合するMAHALO

要約：
– ポリシー学習を目的としたオフライン観察学習(PLfO)の新しいパラダイムを研究
– PLfOは、1)報酬がラベル付けされたトラジェクトリの一部のみ、2)ラベル付きトラジェクトリには行動が含まれていない、3)ラベル付きトラジェクトリが高品質でない、4)全体的なデータが完全でないといった不完全なデータセットを使用してポリシーを学習することを目的とする
– このため、オフラインイミテーション学習(IL)、ILfO、強化学習(RL)など、既存のオフライン学習設定の多くを包括する
– 本研究では、MAHALOと呼ばれる汎用的なアプローチを提供する。MAHALOは、オフラインRLの悲観主義の概念に基づいて、データセットの不十分な収束に起因する不確実性を考慮した性能の下限を用いてポリシーを最適化する
– MAHALOは、ポリシーオプティマイザでのデータ整合性のある批評家と報酬関数を敵対的にトレーニングすることによって実装される。これにより、学習されたポリシーをデータ不足に対して堅牢にすることができる
– MAHALOは、理論上および実験において、様々なオフラインPLfOタスクで専門のアルゴリズムに常に勝るか同等の性能を発揮することが示されている

要約(オリジナル)

We study a new paradigm for sequential decision making, called offline Policy Learning from Observation (PLfO). Offline PLfO aims to learn policies using datasets with substandard qualities: 1) only a subset of trajectories is labeled with rewards, 2) labeled trajectories may not contain actions, 3) labeled trajectories may not be of high quality, and 4) the overall data may not have full coverage. Such imperfection is common in real-world learning scenarios, so offline PLfO encompasses many existing offline learning setups, including offline imitation learning (IL), ILfO, and reinforcement learning (RL). In this work, we present a generic approach, called Modality-agnostic Adversarial Hypothesis Adaptation for Learning from Observations (MAHALO), for offline PLfO. Built upon the pessimism concept in offline RL, MAHALO optimizes the policy using a performance lower bound that accounts for uncertainty due to the dataset’s insufficient converge. We implement this idea by adversarially training data-consistent critic and reward functions in policy optimization, which forces the learned policy to be robust to the data deficiency. We show that MAHALO consistently outperforms or matches specialized algorithms across a variety of offline PLfO tasks in theory and experiments.

arxiv情報

著者	Anqi Li,Byron Boots,Ching-An Cheng
発行日	2023-03-30 05:27:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー