ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference

要約

言語モデルの調整は、モデルの出力をユーザーの意図に合わせて調整するための大規模言語モデルのトレーニングにおける最先端の技術です。たとえば、有益で無害なものになります。
最近のアライメントフレームワークは、デモンストレーションデータを使用した教師あり微調整と、人間の嗜好データを使用した嗜好学習の 2 つのステップで構成されています。
RLHF や DPO などの以前の選好学習方法は、主にペアごとの選好データに焦点を当てていました。
ただし、人間のフィードバックが本質的に点単位である多くの現実世界のシナリオでは、これらの方法は情報損失に悩まされるか、失敗することさえあります。
このギャップを埋めるために、この論文ではまず、点単位の嗜好データに取り組むための点単位 DPO と呼ばれる嗜好学習手法を開発します。
教師あり微調整と点単位の嗜好学習との関係がさらに明らかになったことで、人間のデモンストレーションと点単位の嗜好データの両方のための統一フレームワークを開発できるようになり、嗜好データセットの構築に新たな光が当てられます。
バイナリまたは連続ラベルを使用した点単位のデータセットに関する広範な実験により、提案された方法の優れたパフォーマンスと効率が実証されました。
無害性に関する高品質の実証サンプルを含む新しいデータセットが構築され、一般公開されます。

要約(オリジナル)

Language model alignment is a cutting-edge technique in large language model training to align the model output to user’s intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.

arxiv情報

著者	Tianchi Cai,Xierui Song,Jiyan Jiang,Fei Teng,Jinjie Gu,Guannan Zhang
発行日	2023-12-05 07:52:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー