West-of-N: Synthetic Preferences for Self-Improving Reward Models

要約

言語モデルの調整におけるヒューマンフィードバックからの強化学習 (RLHF) の成功は、基礎となる報酬モデルの品質に大きく依存します。
この論文では、合成選好データを生成することで報酬モデルの品質を向上させる新しいアプローチを紹介します。これにより、ポリシーに沿った高品質の選好ペアでトレーニングデータセットが強化されます。
言語モデルトレーニングにおける Best-of-N サンプリング戦略の有望な結果に動機付けられ、私たちはその応用を報酬モデルトレーニングに拡張します。
これにより、与えられたクエリに対する応答のプールから最良の候補と最悪の候補を選択することによって、好みのペアを生成するという自己学習戦略が実現します。
経験的に、このアプローチはあらゆる報酬モデルのパフォーマンスを向上させ、同量の人間の嗜好データを追加したのと同等の効果があることがわかりました。
この研究は、報酬モデリングの課題に対する解決策として合成嗜好生成を提供することにより、言語モデルの調整のための RLHF を改善するための新しい研究の道を開きます。

要約(オリジナル)

The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.

arxiv情報

著者	Alizée Pace,Jonathan Mallinson,Eric Malmi,Sebastian Krause,Aliaksei Severyn
発行日	2024-10-25 12:04:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

West-of-N: Synthetic Preferences for Self-Improving Reward Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー