Everyone Deserves A Reward: Learning Customized Human Preferences

要約

報酬モデル (RM) は、大規模言語モデル (LLM) を人間の好みに合わせてインタラクションの品質を向上させる上で重要です。
しかし、現実の世界は多元的であり、そのため、異なる宗教、政治、文化などに基づいて人間の好みが多様化します。さらに、各個人はさまざまなトピックについて独自の好みを持つ可能性があります。
人間の好みの多様性を無視すると、現在の LLM トレーニングプロセスは一般的な報酬モデルのみを使用しており、カスタマイズまたはパーソナライズされたアプリケーションシナリオの満足度を下回っています。
カスタマイズされたプリファレンス学習を調査するために、ドメイン固有のプリファレンス (DSP) データセットを収集します。これは、4 つの実用的なドメインから、指定されたクエリごとに優先される応答を収集します。
さらに、データ効率の観点から、3 段階のカスタマイズされた RM 学習スキームを提案し、その有効性が一般嗜好データセットと DSP セットの両方で経験的に検証されました。
さらに、3 つの学習段階で複数のトレーニングとデータ戦略をテストし、カスタマイズされた RM、特に一般的な好みの強化とカスタマイズされた好みの模倣学習をトレーニングしながら、一般的な好みの能力をより適切に保存するいくつかの方法を発見しました。
DSP データセットとコードは https://github.com/Linear95/DSP で入手できます。

要約(オリジナル)

Reward models (RMs) are crucial in aligning large language models (LLMs) with human preferences for improving interaction quality. However, the real world is pluralistic, which leads to diversified human preferences based on different religions, politics, cultures, etc. Moreover, each individual can have their own unique preferences on various topics. Neglecting the diversity of human preferences, current LLM training processes only use a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which collects preferred responses to each given query from four practical domains. Besides, from the perspective of data efficiency, we proposed a three-stage customized RM learning scheme, whose effectiveness is empirically verified on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages, and have found several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.

arxiv情報

著者	Pengyu Cheng,Jiawen Xie,Ke Bai,Yong Dai,Nan Du
発行日	2023-09-06 16:03:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Everyone Deserves A Reward: Learning Customized Human Preferences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー