R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

要約

人間のフィードバックからの強化学習 (RLHF) は、大規模言語モデル (LLM) を人間の好みに合わせるためのパラダイムを提供します。
これには、人間によるペアごとのフィードバックに基づく報酬モデルの初期トレーニングが含まれます。
その後、報酬モデルは強化学習で利用され、生成された各文のスコアを全体として評価し、LLM の最適化をさらに導きます。
ただし、現在のアプローチには重大な欠点があります: \emph{単一の、まばらで、遅延した報酬を出力シーケンス全体に割り当てます}。
これにより、望ましい結果に対する各トークンの重要な個別の貢献が見落とされる可能性があります。
この制限を克服するために、私たちの論文では、よりきめ細かいトークンレベルの報酬割り当てを容易にする、R3HF と呼ばれる新しい報酬再分配方法を提案しています。
具体的には、私たちの方法では、報酬モデルの報酬予測タスクを回帰問題として扱います。
その結果、再分配された報酬は、報酬モデルの出力に対する各トークンの特定の寄与を評価することによって計算されます。
この詳細なアプローチにより、言語のニュアンスに対するモデルの理解が向上し、パフォーマンスがより正確に向上します。
私たちの手法は、計算コストを最小限に抑えながら、最新の技術とシームレスに統合できるように作成されています。
さまざまなデータセットとタスクにわたる包括的な実験を通じて、私たちはアプローチの有効性と優位性を検証しました。

要約(オリジナル)

Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This involves the initial training of a reward model based on pairwise human feedback. The reward model is subsequently utilized in reinforcement learning to assess the scores of each generated sentence as a whole, further guiding the optimization of LLMs. However, current approaches have a significant shortcoming: \emph{They allocate a single, sparse, and delayed reward to an entire sequence of output}. This may overlook some significant individual contributions of each token towards the desired outcome. To overcome this limitation, our paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation. Specifically, our method treats the reward prediction task of the reward model as a regression problem. As a result, the redistributed rewards are computed by evaluating the specific contribution of each token to the reward model’s output. This detailed approach improves the model’s understanding of language nuances, leading to more precise enhancements in its performance. Our method is crafted to integrate seamlessly with most current techniques while incurring minimal computational costs. Through comprehensive experiments across diverse datasets and tasks, we have verified the effectiveness and superiority of our approach.

arxiv情報

著者	Jiahui Li,Tai-wei Chang,Fengda Zhang,Kun Kuang,Long Chen
発行日	2024-11-13 02:45:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー