RLHF and IIA: Perverse Incentives

投稿日: 2023年12月22日作成者: jarxiv

要約

人間のフィードバックからの強化学習のための既存のアルゴリズム (RLHF) は、無関係な代替案の独立性を前提としたモデル (IIA) に基づいているため、好みと相反する応答を奨励する可能性があります。
IIA によって引き起こされる倒錯的なインセンティブは、クエリ形式や学習アルゴリズムの革新時にひどい動作を引き起こします。

要約(オリジナル)

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA give rise to egregious behavior when innovating on query formats or learning algorithms.

arxiv情報

著者	Wanqiao Xu,Shi Dong,Xiuyuan Lu,Grace Lam,Zheng Wen,Benjamin Van Roy
発行日	2023-12-21 01:30:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG パーマリンク