Preference Elicitation for Offline Reinforcement Learning

要約

現実世界の問題に強化学習 (RL) を適用することは、環境と対話できないことや報酬関数の設計が難しいため、多くの場合困難になります。
オフライン RL は、報酬関数によってラベル付けされた環境インタラクションのオフラインデータセットへのアクセスを考慮することで、最初の課題に対処します。
対照的に、プリファレンスベースの RL は、報酬関数へのアクセスを想定せず、プリファレンスから学習しますが、通常は環境とのオンライン対話を必要とします。
私たちは、完全なオフライン設定で好みのフィードバックを取得するための効率的な方法を探索することで、これらのフレームワーク間のギャップを埋めます。
我々は、オフラインの嗜好ベースの強化学習アルゴリズムである Sim-OPRL を提案します。これは、学習された環境モデルを活用して、シミュレートされたロールアウトに関する嗜好フィードバックを引き出します。
オフライン RL とプリファレンスベース RL の両方の文献から得た洞察に基づいて、私たちのアルゴリズムは、分布外データに対しては悲観的なアプローチを採用し、最適なポリシーに関する有益なプリファレンスを取得するためには楽観的なアプローチを採用しています。
オフラインデータが最適なポリシーをどの程度カバーしているかに応じて、アプローチのサンプルの複雑さに関する理論上の保証を提供します。
最後に、さまざまな環境における Sim-OPRL の経験的なパフォーマンスを示します。

要約(オリジナル)

Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.

arxiv情報

著者	Alizée Pace,Bernhard Schölkopf,Gunnar Rätsch,Giorgia Ramponi
発行日	2024-06-26 15:59:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Preference Elicitation for Offline Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー