RRM: Robust Reward Model Training Mitigates Reward Hacking

要約

報酬モデル（RMS）は、大規模な言語モデル（LLM）を人間の好みに合わせて極めて重要な役割を果たします。
ただし、特定のプロンプトに関連付けられた応答ペアに依存している従来のRMトレーニングは、応答の長さや形式などのプロンプトに依存しないアーティファクトからプロンプト駆動型の好みを解くのに苦労しています。
この作業では、現在のRMトレーニング方法の基本的な制限を公開します。この場合、RMSは好みを決定する際にコンテキスト信号と無関係なアーティファクトを効果的に区別できません。
これに対処するために、これらのアーティファクトから独立した好みを学習する因果フレームワークを紹介し、それらを排除するように設計された新しいデータ増強技術を提案します。
広範な実験は、私たちのアプローチが望ましくないアーティファクトをうまく除去し、より堅牢な報酬モデル（RRM）を生成することを示しています。
RRMは、Gemma-2-9B-ITでトレーニングされたペアワイズ報酬モデルのパフォーマンスをRewardBenchで改善し、精度を80.61％から84.15％に増加させます。
さらに、RMとRRMの両方を使用して2つのDPOポリシーをトレーニングし、RRMがDPOに沿ったポリシーを大幅に強化し、MTベンチスコアを7.27から8.31に改善し、Alpacaeval-2の長さ制御されたウィンレートを33.46％から52.49％に改善することを示しています。

要約(オリジナル)

Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.

arxiv情報

著者	Tianqi Liu,Wei Xiong,Jie Ren,Lichang Chen,Junru Wu,Rishabh Joshi,Yang Gao,Jiaming Shen,Zhen Qin,Tianhe Yu,Daniel Sohn,Anastasiia Makarova,Jeremiah Liu,Yuan Liu,Bilal Piot,Abe Ittycheriah,Aviral Kumar,Mohammad Saleh
発行日	2025-02-27 16:30:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RRM: Robust Reward Model Training Mitigates Reward Hacking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー