Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、言語モデル (LM) を人間の好みに合わせるために広く採用されています。
以前の RLHF の作品は通常、バンディット定式化を採用していますが、これは直感的ではありますが、LM 生成の逐次的な性質を無視しており、報酬が少ないという問題に悩まされる可能性があります。
最近の研究では、高密度のトークンレベルの RLHF が提案されていますが、各トークンをアクションとして扱うことは、適切な報酬の割り当てにとって微妙すぎる可能性があります。
このペーパーでは、セグメントレベルの報酬モデルをトレーニングして利用することで、両方の利点を最大限に活用することを目指しています。このモデルは、短いトークンのシーケンスにまたがる意味的に完全なテキストセグメントごとに報酬を割り当てます。
報酬学習の場合、私たちの方法では動的なテキストセグメンテーションと標準的な配列優先データセットとの互換性が可能です。
セグメント報酬に対する効果的な RL ベースの LM トレーニングのために、古典的なスカラーバンディット報酬ノーマライザーを位置認識ノーマライザー関数に一般化し、さらなる高密度化のためにセグメント報酬を補間します。
これらの設計により、私たちの手法は、LM ポリシーの 3 つの一般的な RLHF ベンチマーク、AlpacaEval 2.0、Arena-Hard、および MT-Bench で競合的に実行されます。
私たちの方法をさらに実証するためにアブレーション研究が行われます。

要約(オリジナル)

Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.

arxiv情報

著者	Yueqin Yin,Shentao Yang,Yujia Xie,Ziyi Yang,Yuting Sun,Hany Awadalla,Weizhu Chen,Mingyuan Zhou
発行日	2025-01-06 06:17:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー