Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

要約

トレーニング前の後、大規模な言語モデルは、ペアワイズ比較に基づいて人間の好みと整合しています。
最先端のアライメントメソッド（PPOベースのRLHFやDPOなど）は、ユーザーが多様な選好を持っている設定に展開されているにもかかわらず、単一の優先モデルと整合するという仮定に基づいて構築されます。
その結果、これらのアライメントメソッドがユーザーを平均して満たすモデルを生成することさえ明らかではありません。
ソーシャル選択理論に基づいて、個々のBradley-Terry（BT）モデルを通じてユーザーの比較をモデル化すると、アライメント方法の歪みを導入します。最適な達成可能な平均ユーティリティと学習ポリシーの平均ユーティリティの最悪の比率です。
歪みの概念は、アライメント方法間の鋭い区別を引き出すのに役立ちます：nash人間のフィードバックからの学習は、$（\ frac {1} {2} + o（1））\ cdot \ beta $（bt温度$ \ beta $）の最適な歪みを達成します。
対照的に、rlhfとdpoは、$ \ geq（1 -o（1））\ cdot \ beta $の歪みをklの制約なしで既に患っています。

要約(オリジナル)

After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average — a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users’ comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method’s distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of $(\frac{1}{2} + o(1)) \cdot \beta$ (for the BT temperature $\beta$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $\geq (1 – o(1)) \cdot \beta$ distortion already without a KL constraint, and $e^{\Omega(\beta)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.

arxiv情報

著者	Paul Gölz,Nika Haghtalab,Kunhe Yang
発行日	2025-05-29 17:59:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー