Query-Policy Misalignment in Preference-Based Reinforcement Learning

要約

選好に基づく強化学習(PbRL)は、RLエージェントの行動を人間の望む結果に合わせる自然な方法を提供するが、コストのかかる人間のフィードバックに制約されることが多い。フィードバック効率を向上させるために、既存のPbRL手法の多くは、報酬モデルの全体的な質を最大限に向上させるクエリを選択することに焦点を当てているが、直感に反して、これは必ずしも性能の向上につながらない可能性があることがわかった。この謎を解明するために、我々は既存のPbRL研究のクエリー選択スキームにおいて、長い間無視されてきた問題を特定する：それは「クエリとポリシーの不一致」である。報酬モデルの全体的な質を向上させるために選択された一見有益に見えるクエリが、実はRLエージェントの関心と一致していない可能性があり、その結果、ポリシー学習においてほとんど役に立たず、最終的にフィードバック効率が悪くなることを示す。我々は、この問題を、ニアオンポリシークエリーと、特別に設計されたハイブリッド経験再生によって効果的に解決できることを示す。シンプルかつエレガントな本手法は、数行のコードを変更するだけで、既存のアプローチに容易に組み込むことができる。我々は包括的な実験により、本手法が人間のフィードバックとRLサンプルの効率性の両方において大幅な改善を達成することを示し、PbRLタスクにおけるクエリとポリシーの不整合に対処することの重要性を実証する。

要約(オリジナル)

Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents’ behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents’ interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.

arxiv情報

著者	Xiao Hu,Jianxiong Li,Xianyuan Zhan,Qing-Shan Jia,Ya-Qin Zhang
発行日	2024-07-05 14:26:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Query-Policy Misalignment in Preference-Based Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー