Axioms for AI Alignment from Human Feedback

要約

ヒューマンフィードバックからの強化学習 (RLHF) のコンテキストでは、報酬関数は一般に、人間によって行われたペアごとの比較に基づくランダムな実用モデルの最尤推定から導出されます。
報酬関数を学習するという問題は、選好の集約の問題の 1 つであり、これは主に社会的選択理論の範囲内であると我々は主張しています。
この観点から、確立された公理を介してさまざまな集計方法を評価し、これらの方法が既知の基準を満たすか満たさないかを調べることができます。
我々は、ブラッドリー・テリー・ルースモデルとその広範な一般化の両方が基本的な公理を満たしていないことを実証します。
これに応えて、私たちは強力な公理的保証を備えた報酬関数を学習するための新しいルールを開発します。
社会的選択の観点から見た重要な革新は、私たちの問題が線形構造を持っていることです。これにより、実行可能なルールの範囲が大幅に制限され、線形社会選択と呼ばれる新しいパラダイムがもたらされます。

要約(オリジナル)

In the context of reinforcement learning from human feedback (RLHF), the reward function is generally derived from maximum likelihood estimation of a random utility model based on pairwise comparisons made by humans. The problem of learning a reward function is one of preference aggregation that, we argue, largely falls within the scope of social choice theory. From this perspective, we can evaluate different aggregation methods via established axioms, examining whether these methods meet or fail well-known standards. We demonstrate that both the Bradley-Terry-Luce Model and its broad generalizations fail to meet basic axioms. In response, we develop novel rules for learning reward functions with strong axiomatic guarantees. A key innovation from the standpoint of social choice is that our problem has a linear structure, which greatly restricts the space of feasible rules and leads to a new paradigm that we call linear social choice.

arxiv情報

著者	Luise Ge,Daniel Halpern,Evi Micha,Ariel D. Procaccia,Itai Shapira,Yevgeniy Vorobeychik,Junlin Wu
発行日	2024-11-07 15:40:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Axioms for AI Alignment from Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー