Rethinking Diverse Human Preference Learning through Principal Component Analysis

要約

人間の好みを理解することは、基礎モデルを改善し、パーソナライズされたAIシステムを構築するために重要です。
ただし、好みは本質的に多様で複雑であるため、従来の報酬モデルがフルレンジをキャプチャすることは困難です。
きめの粒度の優先データは役立ちますが、収集するのは高価で拡張が難しいです。
この論文では、分解された報酬モデル（DRMS）を紹介します。これは、細めに成長した注釈を必要とせずに、多様な人間の好みをバイナリ比較から抽出する新しいアプローチです。
私たちの重要な洞察は、人間の好みをベクトルとして表現し、主成分分析（PCA）を使用して分析することです。
優先応答と拒否された応答の違いの違いのデータセットを構築することにより、DRMは好みの異なる側面をキャプチャする直交基底ベクトルを識別します。
これらの分解された報酬を柔軟に組み合わせて、さまざまなユーザーニーズに合わせて、従来の報酬モデルに代わる解釈可能でスケーラブルな代替品を提供できます。
DRMSは、意味のある好みの寸法（例えば、有用性、安全性、ユーモアなど）を効果的に抽出し、追加のトレーニングなしで新しいユーザーに適応することを実証します。
私たちの結果は、DRMSをパーソナライズされた解釈可能なLLMアライメントの強力なフレームワークとして強調しています。

要約(オリジナル)

Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.

arxiv情報

著者	Feng Luo,Rui Yang,Hao Sun,Chunyuan Deng,Jiarui Yao,Jingyan Shen,Huan Zhang,Hanjie Chen
発行日	2025-02-18 18:55:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking Diverse Human Preference Learning through Principal Component Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー