The Capacity for Moral Self-Correction in Large Language Models

要約

人間のフィードバックからの強化学習 (RLHF) で訓練された言語モデルは、指示された場合、有害な出力の生成を回避するために「道徳的に自己修正」する能力を持っているという仮説をテストします。
3 つの異なる実験で、この仮説を支持する強力な証拠を見つけました。それぞれの実験で、道徳的自己修正のさまざまな側面が明らかになりました。
道徳的自己修正の能力は 22B モデルパラメーターで出現し、通常はモデルサイズの増加と RLHF トレーニングで改善されることがわかりました。
このレベルの規模では、言語モデルは道徳的な自己修正に使用できる 2 つの能力を獲得すると考えられます。(1) 指示に従うことができ、(2) 固定観念、偏見、差別などの害に関する複雑な規範的概念を学習することができます。
.
そのため、特定の種類の道徳的に有害なアウトプットを避けるための指示に従うことができます。
私たちの結果は、言語モデルをトレーニングして倫理原則を順守する能力に関して慎重な楽観主義の原因であると信じています.

要約(オリジナル)

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to ‘morally self-correct’ — to avoid producing harmful outputs — if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

arxiv情報

著者	Deep Ganguli,Amanda Askell,Nicholas Schiefer,Thomas Liao,Kamilė Lukošiūtė,Anna Chen,Anna Goldie,Azalia Mirhoseini,Catherine Olsson,Danny Hernandez,Dawn Drain,Dustin Li,Eli Tran-Johnson,Ethan Perez,Jackson Kernion,Jamie Kerr,Jared Mueller,Joshua Landau,Kamal Ndousse,Karina Nguyen,Liane Lovitt,Michael Sellitto,Nelson Elhage,Noemi Mercado,Nova DasSarma,Robert Lasenby,Robin Larson,Sam Ringer,Sandipan Kundu,Saurav Kadavath,Scott Johnston,Shauna Kravec,Sheer El Showk,Tamera Lanham,Timothy Telleen-Lawton,Tom Henighan,Tristan Hume,Yuntao Bai,Zac Hatfield-Dodds,Ben Mann,Dario Amodei,Nicholas Joseph,Sam McCandlish,Tom Brown,Christopher Olah,Jack Clark,Samuel R. Bowman,Jared Kaplan
発行日	2023-02-15 04:25:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Capacity for Moral Self-Correction in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー