Towards Understanding Sycophancy in Language Models

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、高品質の AI アシスタントをトレーニングするための一般的な手法です。
ただし、RLHF は、真実の応答よりもユーザーの信念に一致するモデル応答、つまりお調子者として知られる行動を奨励する場合もあります。
私たちは、RLHF で訓練されたモデルにおけるお調子者の蔓延と、人間の好みの判断が原因であるかどうかを調査します。
まず、5 つの最先端の AI アシスタントが、4 つのさまざまな自由形式のテキスト生成タスクにわたって一貫して媚びた行動を示すことを実証します。
人間の好みがこの広く観察されている RLHF モデルの動作を促進しているかどうかを理解するために、既存の人間の好みデータを分析します。
応答がユーザーの意見と一致する場合、その応答が好まれる可能性が高いことがわかりました。
さらに、人間も選好モデル (PM) も、無視できない割合で、正しい応答よりも説得力を持って書かれたおべっかな応答を好みます。
PM に対してモデルの出力を最適化すると、お調子者を優先して真実性が犠牲になる場合もあります。
全体として、私たちの結果は、おべっかはRLHFモデルの一般的な行動であり、おべっかな反応を好む人間の好みの判断によって部分的に引き起こされる可能性があることを示しています。

要約(オリジナル)

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgements are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user’s views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgements favoring sycophantic responses.

arxiv情報

著者	Mrinank Sharma,Meg Tong,Tomasz Korbak,David Duvenaud,Amanda Askell,Samuel R. Bowman,Newton Cheng,Esin Durmus,Zac Hatfield-Dodds,Scott R. Johnston,Shauna Kravec,Timothy Maxwell,Sam McCandlish,Kamal Ndousse,Oliver Rausch,Nicholas Schiefer,Da Yan,Miranda Zhang,Ethan Perez
発行日	2023-10-24 17:12:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Understanding Sycophancy in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー