AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

要約

この論文は、人間によるフィードバック (RLHF) または AI フィードバック (RLAIF) を含む、フィードバックによる強化学習 (RLxF) 手法を通じて、人工知能 (AI) システム、特に大規模言語モデル (LLM) を人間の価値観や意図に合わせようとする試みを批判的に評価します。
）。
具体的には、誠実さ、無害さ、有用性という広く追求されている調整目標の欠点を示します。
私たちは学際的な社会技術的批評を通じて、RLxF 技術の理論的基礎と実際の実装の両方を検証し、人間倫理の複雑さを捉えて AI の安全性に貢献するアプローチにおける重大な限界を明らかにします。
私たちは、RLxF の目標に内在する緊張と矛盾を強調します。
さらに、整合性と RLxF に関する議論で無視されがちな倫理関連の問題についても議論します。その中には、使いやすさと欺瞞性、柔軟性と解釈可能性、システムの安全性の間のトレードオフもあります。
私たちは研究者と実践者に同様に、RLxF の社会技術的影響を批判的に評価するよう促し、AI 開発における RLxF の応用に対するより微妙で思慮深いアプローチを提唱することで締めくくります。

要約(オリジナル)

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.

arxiv情報

著者	Adam Dahlgren Lindström,Leila Methnani,Lea Krause,Petter Ericson,Íñigo Martínez de Rituerto de Troya,Dimitri Coelho Mollo,Roel Dobbe
発行日	2024-06-26 13:42:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー