Aligning AI With Shared Human Values

要約

道徳の基本概念に関する言語モデルの知識を評価する方法を示します。
正義、幸福、義務、美徳、常識的な道徳の概念にまたがる新しいベンチマークである ETHICS データセットを紹介します。
モデルは、さまざまなテキストシナリオに関する広範な道徳的判断を予測します。
これには、物理的および社会的世界の知識を価値判断に結び付ける必要があります。これにより、チャットボットの出力を操作したり、最終的にはオープンエンドの強化学習エージェントを正規化したりすることができるようになる可能性があります。
ETHICS データセットを使用すると、現在の言語モデルには、人間の基本的な倫理的判断を予測する有望ではあるが不完全な能力があることがわかります。
私たちの仕事は、今日の機械倫理が進歩できることを示しており、人間の価値観に沿った AI への足がかりを提供します。

要約(オリジナル)

We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

arxiv情報

著者	Dan Hendrycks,Collin Burns,Steven Basart,Andrew Critch,Jerry Li,Dawn Song,Jacob Steinhardt
発行日	2023-02-17 16:08:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning AI With Shared Human Values

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー