Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

要約

複雑な目的を正確に指定することは困難であるため、補強学習ポリシーは、真の目標を近似するだけのプロキシ報酬関数を使用して最適化されます。
ただし、プロキシの報酬を最適化することで、報酬のハッキングに頻繁につながります。最適化された報酬機能は優れたプロキシになり、結果として得られるポリシーは、不特定の真の報酬に関してパフォーマンスが低下します。
ハッキングに報いる原則的なソリューションは、問題の良い定義がないことによって妨げられています。
このギャップに対処するために、最適化の下で崩壊する「参照ポリシー」によって見られる州のプロキシと真の報酬との相関関係に基づいて、報酬ハッキングの定義を紹介します。
この定義は、人間のフィードバック（RLHF）からの強化学習を含む、いくつかの現実的な設定にわたって報酬のハッキング動作をキャプチャしていることを示しています。
定式化を使用して、参照ポリシーへの正則化が報酬のハッキングを効果的に防ぐことができることを理論的に示します。
RLHFの現在の慣行は、この目的のためのアクション分布間のKLペナルティを適用しますが、我々の理論は、ポリシーの占有措置間の$ \ chi^2 $の相違を正規化することがより効果的であることを示唆しています。
このタイプの正規化の利点を直感的に示し、RLHFを含む4つの現実的な設定にわたって実際に報酬のハッキングをより緩和することを実証しています。
私たちのコードは、https：//github.com/cassidylaidlaw/orpoで入手できます。

要約(オリジナル)

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a ‘reference policy’ that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the $\chi^2$ divergence between the policies’ occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.

arxiv情報

著者	Cassidy Laidlaw,Shivam Singhal,Anca Dragan
発行日	2025-03-13 17:35:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー