When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

要約

人間のフィードバックからの強化学習 (RLHF) の過去の分析は、人間の評価者が環境を完全に観察していることを前提としています。
人間のフィードバックが部分的な観察のみに基づいている場合はどうなるでしょうか?
私たちは、欺瞞的なインフレと過剰な正当化という 2 つの失敗ケースを正式に定義します。
人間をボルツマン合理的としてモデル化する
軌道よりも信念を重視することで、RLHF が彼らのパフォーマンスを欺瞞的に水増ししたり、印象を与えるために彼らの行動を過度に正当化したり、あるいはその両方を行う政策を確実にもたらす条件を証明します。
人間の部分的な可観測性は既知であり説明されているという新しい仮定の下で、フィードバックプロセスが戻り関数に関してどの程度の情報を提供するかを分析します。
我々は、人間のフィードバックによって戻り関数が加法定数まで一意に決定される場合があるが、他の現実的なケースでは、還元できない曖昧さが存在することを示します。
私たちは、これらの課題に取り組み、理論的な懸念と潜在的な緩和策の両方を実験的に検証し、部分的に観察可能な設定でRLHFを盲目的に適用することに警告するのに役立つ探索的研究の方向性を提案します。

要約(オリジナル)

Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human’s partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human’s feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.

arxiv情報

著者	Leon Lang,Davis Foote,Stuart Russell,Anca Dragan,Erik Jenner,Scott Emmons
発行日	2024-11-05 16:46:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー