Information-Theoretic Reward Decomposition for Generalizable RLHF

要約

一般化可能な報酬モデルは、人間のフィードバック（RLHF）からの強化学習において重要です。これは、目に見えない迅速な応答ペアを正しく評価できるためです。
ただし、既存の報酬モデルには、選択された応答と拒否された応答の間の報酬のギャップを増やすことで通常訓練されているため、応答が条件付けられているプロンプトを見落とすことで訓練されるため、この能力が欠けています。
その結果、訓練された報酬モデルがデータ分布の外側にあるプロンプト応答ペアで評価されると、プロンプトの効果を無視すると、報酬モデルの一般化が不十分になる可能性があります。
この問題に対処するために、報酬値を2つの独立したコンポーネントに分解します：迅速な報酬と迅速な関連報酬。
プロンプトフリーの報酬は、応答によってのみ決定される評価を表しますが、プロンプト関連の報酬は、プロンプトと応答の両方に由来する報酬を反映しています。
これらの2つのコンポーネントを情報理論的な観点から抽出します。これには、追加のモデルは必要ありません。
その後、迅速な報酬値に基づいてデータサンプルに優先順位を付けることにより、新しい報酬学習アルゴリズムを提案します。
おもちゃの例を通じて、抽出されたプロンプトフリーで迅速な報酬が報酬モデルの2つの部分を効果的に特徴付けることを実証します。
さらに、標準的な評価は、この方法が報酬モデルのアライメントパフォーマンスと一般化能力の両方を改善することを示しています。

要約(オリジナル)

A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

arxiv情報

著者	Liyuan Mao,Haoran Xu,Amy Zhang,Weinan Zhang,Chenjia Bai
発行日	2025-04-08 13:26:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Information-Theoretic Reward Decomposition for Generalizable RLHF

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー