Learning a Canonical Basis of Human Preferences from Binary Ratings

要約

生成AIの最近の進歩は、人間のフィードバック（RLHF）からの強化学習などのアライメント技術によって推進されています。
RLHFおよび関連する手法には通常、バイナリまたはランク付けされた選択肢のデータセットを構築し、その後、これらの好みに合わせて微調整されたモデルを構築します。
このペーパーでは、このようなデータセットにエンコードされた好みを理解し、一般的な人間の好みを特定することに焦点を移します。
21の優先カテゴリの小さなサブセット（ほぼ5,000個の異なる好みのセットから選択）が、個人間の優先変動の89％以上をキャプチャします。
この小さな一連の好みは、心理学または顔認識研究の人間の変動を特徴付ける確立された発見と同様に、人間の好みの標準的な基礎に類似しています。
合成評価と経験的評価の両方を通じて、データセット全体および特定のトピック内で、低ランクの標準的な人間の好みが一般化されることを確認します。
さらに、モデル評価における優先ベースのユーティリティを実証します。優先カテゴリでは、モデルのアラインメントに関するより深い洞察とモデルトレーニングで、好みの定義されたサブセットがそれに応じてモデルを正常に整列させることを示します。

要約(オリジナル)

Recent advances in generative AI have been driven by alignment techniques such as reinforcement learning from human feedback (RLHF). RLHF and related techniques typically involve constructing a dataset of binary or ranked choice human preferences and subsequently fine-tuning models to align with these preferences. This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences. We find that a small subset of 21 preference categories (selected from a set of nearly 5,000 distinct preferences) captures >89% of preference variation across individuals. This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies. Through both synthetic and empirical evaluations, we confirm that our low-rank, canonical set of human preferences generalizes across the entire dataset and within specific topics. We further demonstrate our preference basis’ utility in model evaluation, where our preference categories offer deeper insights into model alignment, and in model training, where we show that fine-tuning on preference-defined subsets successfully aligns the model accordingly.

arxiv情報

著者	Kailas Vodrahalli,Wei Wei,James Zou
発行日	2025-03-31 14:35:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning a Canonical Basis of Human Preferences from Binary Ratings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー