Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

要約

有用な強化学習（RL）エージェントを作成するために、ステップゼロは、タスクのニュアンスをキャプチャする適切な報酬関数を設計することです。
ただし、報酬エンジニアリングは、困難で時間のかかるプロセスになる可能性があります。
代わりに、Human-in-the-LoopRLメソッドは、人間のフィードバックから報酬機能を学習するという約束を保持しています。
最近の成功にもかかわらず、人間のループRLメソッドの多くは、成功した報酬機能を学習するために、依然として多数の人間の相互作用が必要です。
ヒューマンインザループRLメソッドのフィードバック効率を改善するため（つまり、人間の相互作用が少ない）、このペーパーでは、トレーニング前のデータ、SDPを紹介します。これは、報酬のない、最適なデータを活用して、スカラーおよび好みベースのRLアルゴリスムを改善します。
SDPでは、最小環境報酬を使用して、すべての低品質データを擬似標識することから始めます。
このプロセスを通じて、人間のラベル付けや好みを必要とせずに、報酬モデルを事前に訓練するための報酬ラベルを取得します。
このトレーニング前のフェーズは、報酬モデルに学習のヘッドスタートを提供し、低品質の遷移に低い報酬を割り当てる必要があることを認識できるようにします。
シミュレートされた教師と人間の両方の教師を使用した広範な実験を通じて、SDPは少なくとも、さまざまなシミュレートされたロボットタスクにわたって芸術の人間のRLパフォーマンスを満たすことができるが、しばしば大幅に改善できることがわかります。

要約(オリジナル)

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.

arxiv情報

著者	Calarina Muslimani,Matthew E. Taylor
発行日	2025-04-07 23:17:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー