The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

要約

この作業は、人間のフィードバック（RLHF）からの補強学習におけるエネルギー損失現象と、報酬ハッキングへの接続を特定します。
具体的には、大規模な言語モデル（LLM）の最終層のエネルギー損失は、RLプロセス中に徐々に増加し、報酬のハッキングを特徴付けるエネルギー損失の過剰な増加があります。
経験的分析を超えて、穏やかな条件下で、エネルギー損失の増加はLLMSの文脈的関連性の上限を減らすことを証明することにより、理論的基盤をさらに提供します。
RLのモデルを好むパターン。
この問題に対処するために、過度のエネルギー損失を防ぎ、報酬ハッキングを緩和するために、報酬計算中にLLMの最終層のエネルギー損失の増加を罰するエネルギー損失を認めるPPOアルゴリズム（EPPO）を提案します。
EPPOは、その有効性に関するより深い洞察を提供するエントロピー正規化RLアルゴリズムとして概念的に解釈できることを理論的に示しています。
さまざまなLLMSおよびタスクにわたる広範な実験は、エネルギー損失現象の共通性と、RLHFのパフォーマンスを軽減し、RLHFパフォーマンスを改善する際の\ texttt {eppo}の有効性を示しています。

要約(オリジナル)

This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM’s final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of \texttt{EPPO} in mitigating reward hacking and improving RLHF performance.

arxiv情報

著者	Yuchun Miao,Sen Zhang,Liang Ding,Yuqi Zhang,Lefei Zhang,Dacheng Tao
発行日	2025-01-31 18:10:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー