Learning Human-like Representations to Enable Learning Human Values

要約

人間の個人的な価値観を迅速かつ安全に学習し、学習プロセス中に危害を加えたり、許容される行動に関する社会基準に違反したりすることを回避できる AI システムを構築するにはどうすればよいでしょうか?
私たちは、人間と AI エージェントの間の表現上の調整が人間の価値観の学習に及ぼす影響を調査します。
AI システムに世界の人間のような表現を学習させることには、汎化性の向上、ドメインシフトに対する堅牢性、少数ショット学習パフォーマンスの向上など、多くの既知の利点があります。
私たちは、この種の表現的調整が、パーソナライゼーションの文脈における人間の価値観の安全な学習と探索をサポートできることを実証します。
私たちは理論的な予測から始めて、それが人間の道徳的判断の学習に当てはまることを示し、次にその結果が、倫理、誠実さ、公平性を含む人間の価値観の 10 の異なる側面に一般化され、各価値観について AI エージェントを訓練することを示します。
多腕の盗賊の設定で、報酬は選択された行動に対する人間の価値判断を反映します。
一連のテキストによるアクション記述を使用して、人間からの価値判断と、人間と複数の言語モデルの両方からの類似性判断を収集し、人間の価値観を学習する際に、表現的整合によって安全な探索と改善された一般化の両方が可能になることを実証します。

要約(オリジナル)

How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values — including ethics, honesty, and fairness — training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.

arxiv情報

著者	Andrea Wynn,Ilia Sucholutsky,Thomas L. Griffiths
発行日	2024-11-08 17:33:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Human-like Representations to Enable Learning Human Values

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー