HAF-RM: A Hybrid Alignment Framework for Reward Model Training

要約

報酬モデルは、大規模言語モデル (LLM) の調整、評価、データ構築においてますます重要になっています。
既存の研究者のほとんどは、予測される報酬を直接最適化する報酬モデルの従来のトレーニングフレームワークに従い、データの改善を通じて報酬モデルを強化することに重点を置いています。
この論文では、報酬スコアに加えてトークンレベルのポリシー確率に対する追加の制約を導入することにより、報酬モデルトレーニングのためのハイブリッドアライメントフレームワークHaF-RMを提案します。
トークンレベルで内部優先モデルを監視し、シーケンスレベルで報酬モデルのマッピング層を最適化することを同時に行うことができます。
5 つのデータセットに関する実験結果は、高品質の報酬モデルをトレーニングするための私たちが提案したハイブリッドフレームワークの妥当性と有効性を十分に示しています。
報酬モデリング手順を分離し、ハイブリッド監視を組み込むことで、当社の HaF-RM フレームワークは、強力な言語モデルの責任ある開発における重要なコンポーネントである報酬モデルのパフォーマンスと調整を強化するための原則に基づいた効果的なアプローチを提供します。
コードは https://haf-rm.github.io でリリースされます。

要約(オリジナル)

The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.

arxiv情報

著者	Shujun Liu,Xiaoyu Shen,Yuhang Lai,Siyuan Wang,Shengbin Yue,Zengfeng Huang,Xuanjing Huang,Zhongyu Wei
発行日	2025-01-08 17:11:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー