Reward Model Overoptimisation in Iterated RLHF

要約

人間のフィードバック（RLHF）からの強化学習は、大規模な言語モデルを人間の好みに合わせて広く使用されています。
ただし、RLHFはしばしば報酬モデルの過剰最適化に苦しんでおり、モデルは報酬機能に過剰に促進され、その結果、報酬関数の特異性と特異性を活用する一般化できないポリシーが生じます。
一般的な緩和はRLHFで繰り返され、報酬モデルは、更新された人間のフィードバックとポリシーが再最適化されて繰り返し再訓練されます。
採用が増加しているにもかかわらず、この設定での過剰な最適化のダイナミクスはよく理解されていません。
この作業では、反復RLHFにおける過剰最適化の最初の包括的な研究を提示します。
主要な設計の選択肢を体系的に分析します – 報酬モデルトレーニングデータが反復全体で転送される方法、報酬機能は最適化に使用され、ポリシーがどのように初期化されますか。
制御されたAlpacafarmベンチマークを使用して、報酬モデルがますます近似地面の好みがますます近づいているため、過剰な最適化は連続した反復を減少させる傾向があることがわかります。
ただし、パフォーマンスの向上は時間の経過とともに減少し、基本ポリシーからの再認証は堅牢ですが、最適化の柔軟性を制限します。
他の初期化戦略は、多くの場合、早期過剰な最適化から回復できません。
これらの調査結果は、より安定した一般化可能なRLHFパイプラインを構築するための実用的な洞察を提供します。

要約(オリジナル)

Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices – how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.

arxiv情報

著者	Lorenz Wolf,Robert Kirk,Mirco Musolesi
発行日	2025-05-23 17:36:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reward Model Overoptimisation in Iterated RLHF

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー