Everyone Deserves A Reward: Learning Customized Human Preferences

要約

報酬モデル (RM) は、大規模言語モデル (LLM) を人間の好みに合わせてインタラクションの質を向上させるために不可欠です。
しかし、現実の世界は多元的であり、そのため、さまざまな宗教、政治、文化などに関して人間の好みが多様化します。さらに、各個人はさまざまなトピックについて独自の好みを持つ可能性があります。
人間の好みの多様性を無視すると、現在の人間によるフィードバック調整方法は一般的な報酬モデルのみを考慮しており、カスタマイズまたはパーソナライズされたアプリケーションシナリオの満足度を下回っています。
カスタマイズされたプリファレンス学習を調査するために、4 つの実用的なドメインからの特定のクエリごとに優先される応答を含む、ドメイン固有のプリファレンス (DSP) データセットを収集します。
さらに、データ効率の観点から、3 段階のカスタマイズされた RM 学習スキームを提案し、一般的な嗜好データセットと DSP セットの両方でその有効性を経験的に検証します。
さらに、3 つの学習段階で複数のトレーニングとデータ戦略をテストします。
カスタマイズされた RM をトレーニングする際に、一般的な好みの能力をより適切に保存するためのいくつかの方法、特に一般的な好みの強化とカスタマイズされた好みの模倣学習を見つけました。
DSP データセットとコードは https://github.com/Linear95/DSP で入手できます。

要約(オリジナル)

Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages. We find several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment, and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.

arxiv情報

著者	Pengyu Cheng,Jiawen Xie,Ke Bai,Yong Dai,Nan Du
発行日	2023-09-15 09:24:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Everyone Deserves A Reward: Learning Customized Human Preferences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー