The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

要約

本研究では、人間のフィードバックからの強化学習(RLHF)におけるエネルギー損失現象と、その報酬ハッキングとの関連を明らかにする。具体的には、大規模言語モデル(LLM)の最終層におけるエネルギー損失は、RLプロセス中に徐々に増加し、エネルギー損失の過度の増加は報酬ハッキングを特徴付ける。経験的分析にとどまらず、我々はさらに、穏やかな条件下で、エネルギー損失の増大がLLMの文脈関連性の上限を減少させることを証明することにより、理論的基礎を提供する。これは、文脈関連性の減少は、典型的にはRLにおける報酬モデル有利なパターンへの過剰適合を示すため、報酬ハッキングの重要な側面である。この問題に対処するため、我々は、報酬計算中のLLMの最終層におけるエネルギー損失の増加にペナルティを与え、過剰なエネルギー損失を防ぐことで、報酬ハッキングを緩和する、エネルギー損失を考慮したPPOアルゴリズム（EPPO）を提案する。EPPOがエントロピー正則化RLアルゴリズムとして概念的に解釈できることを理論的に示し、EPPOの有効性をより深く理解する。様々なLLMとタスクにわたる広範な実験により、エネルギー損失現象の共通性と、報酬ハッキングを緩和しRLHF性能を向上させるEPPOの有効性を実証する。

要約(オリジナル)

This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM’s final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of EPPO in mitigating reward hacking and improving RLHF performance.

arxiv情報

著者	Yuchun Miao,Sen Zhang,Liang Ding,Yuqi Zhang,Lefei Zhang,Dacheng Tao
発行日	2025-02-04 16:22:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー