Energy-Based Reward Models for Robust Language Model Alignment

要約

報酬モデル（RMS）は、大規模な言語モデル（LLM）を人間の好みに合わせるために不可欠です。
しかし、彼らはしばしば、複雑な人間の好みをキャプチャし、目に見えないデータへの一般化に苦労しています。
これらの課題に対処するために、RMの堅牢性と一般化を強化する軽量の事後精製フレームワークであるエネルギーベースの報酬モデル（EBRM）を紹介します。
EBRMは報酬分布を明示的にモデル化し、人間の好みに不確実性を捉え、ノイズの多い注釈の影響を軽減します。
これは、競合するデータフィルタリング、ラベルノイズを意識したコントラストトレーニング、およびハイブリッド初期化を通じてこれを達成します。
特に、EBRMは再訓練せずにRMSを強化し、異なるモデルやタスクで計算上効率的で適応可能にします。
RMベンチマークでの経験的評価は、堅牢性と一般化の両方の大幅な改善を示し、標準のRMSと比較して安全性が批判的なアライメントタスクの最大5.97％の改善を達成しました。
さらに、補強学習実験は、洗練された報酬がアライメント品質を向上させ、報酬のハッキングを効果的に遅らせることを確認しています。
これらの結果は、既存のRMSおよびアライメントパイプラインのスケーラブルで効果的な強化としてのアプローチを示しています。
コードはEBRMで利用できます。

要約(オリジナル)

Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.

arxiv情報

著者	Anamika Lochab,Ruqi Zhang
発行日	2025-04-17 17:47:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Energy-Based Reward Models for Robust Language Model Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー