Optimal Design for Reward Modeling in RLHF

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、言語モデル (LM) を人間の好みに合わせるための一般的なアプローチになっています。
この方法には、さまざまなテキスト世代にわたる人間のペアごとの好みの大規模なデータセットを収集し、それを使用して報酬モデルを (暗黙的または明示的に) 推論することが含まれます。
報酬モデルを学習し、LM をそれに合わせるための多くの方法が提案されています。
しかし、人間の好みを収集するというコストのかかるプロセスはほとんど注目されておらず、理論的な洞察から恩恵を受ける可能性があります。
この論文はこの問題に対処し、RLHF における報酬トレーニングモデルを形式化することを目的としています。
線形コンテキストデュエルバンディット法を使用して、効果的なデータセットの選択を単純なリグレス最小化タスクとして組み立てます。
潜在的に多数のアームが存在することを考慮すると、このアプローチは最良のアームの識別設定よりも一貫性があります。
次に、この問題を解決するためのオフラインフレームワークを提案します。
適切な仮定 (埋め込み空間における報酬モデルの線形性、および報酬パラメーターの有界性) の下で、単純な後悔の限界を導き出します。
最後に、定数項および対数項までの上限と一致する下限を提供します。
私たちの知る限り、これは、オフラインのアプローチと最悪の場合の保証を提供する、この分野における最初の理論的貢献です。

要約(オリジナル)

Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline framework for solving this problem. Under appropriate assumptions – linearity of the reward model in the embedding space, and boundedness of the reward parameter – we derive bounds on the simple regret. Finally, we provide a lower bound that matches our upper bound up to constant and logarithmic terms. To our knowledge, this is the first theoretical contribution in this area to provide an offline approach as well as worst-case guarantees.

arxiv情報

著者	Antoine Scheid,Etienne Boursier,Alain Durmus,Michael I. Jordan,Pierre Ménard,Eric Moulines,Michal Valko
発行日	2024-10-22 14:36:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Optimal Design for Reward Modeling in RLHF

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー