Reinforcement Learning from Diverse Human Preferences

要約

報酬関数の設計の複雑さは、深層強化学習 (RL) 技術の幅広い応用にとって大きな障害となってきました。
エージェントの望ましい動作や特性を説明することは、専門家であっても難しい場合があります。
人間の嗜好からの強化学習 (または嗜好ベースの RL) と呼ばれる新しいパラダイムが、有望な解決策として浮上しています。このパラダイムでは、行動軌跡の中の人間の嗜好ラベルから報酬関数が学習されます。
ただし、プリファレンスベースの RL の既存の方法は、正確な Oracle プリファレンスラベルの必要性によって制限されます。
この論文では、クラウドソーシングの好みラベルを開発し、人間の多様な好みから学習する方法を開発することで、この制限に対処します。
重要なアイデアは、潜在空間での正則化と修正を通じて報酬学習を安定化することです。
時間的一貫性を確保するために、潜在空間が事前分布に近づくように強制する強力な制約が報酬モデルに課されます。
さらに、信頼度に基づく報酬モデルのアンサンブル手法は、より安定した信頼性の高い予測を生成するように設計されています。
提案された手法は、DMcontrol およびメタワールドのさまざまなタスクでテストされ、多様なフィードバックから学習する際に、既存の設定ベースの RL アルゴリズムに比べて一貫した大幅な改善が見られ、RL 手法の現実世界への応用への道が開かれました。

要約(オリジナル)

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent’s desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

arxiv情報

著者	Wanqi Xue,Bo An,Shuicheng Yan,Zhongwen Xu
発行日	2024-05-08 15:58:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinforcement Learning from Diverse Human Preferences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー