Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto

要約

次世代の人工知能 (AI) システムの安全性を確保することへの関心が高まっているため、自律エージェントに道徳を組み込むための新しいアプローチが求められています。
この目標は、従来のタスク固有の AI 手法とは質的に異なります。
この論文では、機械に道徳性を導入するという問題に対する既存のアプローチを体系化し、連続体としてモデル化します。
私たちの分析によると、一般的なテクニックはこの連続体の両極端にあり、トップダウンの明示的なルールに完全にハードコーディングされているか、道徳原則を直接述べずにボトムアップの暗黙的な方法で完全に学習されているかのどちらかです（これには、
大規模な言語モデル (LLM) のトレーニングと微調整に適用される人間のフィードバックからの学習)。
各タイプの方法論の相対的な長所と短所を考慮すると、適応性があり堅牢でありながら、制御可能で解釈可能なエージェントシステムを作成するには、より多くのハイブリッドソリューションが必要であると主張します。
そのために、この文書では、倫理的基盤 (義務論、結果主義、美徳倫理を含む) と道徳的に整合した AI システムの実装の両方について説明します。
純粋な強化学習または LLM ベースのエージェントに適用される、固有の報酬、道徳的制約、またはテキストによる指示に依存する一連のケーススタディを紹介します。
これらの多様な実装を 1 つのフレームワークの下で分析することで、道徳的に整合した AI システムの開発における相対的な長所と短所を比較します。
次に、道徳学習エージェントの有効性を評価するための戦略について説明します。
最後に、このハイブリッドフレームワークから明らかになる、AI の安全性と倫理の将来に対する未解決の研究上の疑問と影響を紹介します。

要約(オリジナル)

Increasing interest in ensuring the safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents. This goal differs qualitatively from traditional task-specific AI methodologies. In this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines – modelled as a continuum. Our analysis suggests that popular techniques lie at the extremes of this continuum – either being fully hard-coded into top-down, explicit rules, or entirely learned in a bottom-up, implicit fashion with no direct statement of any moral principle (this includes learning from human feedback, as applied to the training and finetuning of large language models, or LLMs). Given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create adaptable and robust, yet controllable and interpretable agentic systems. To that end, this paper discusses both the ethical foundations (including deontology, consequentialism and virtue ethics) and implementations of morally aligned AI systems. We present a series of case studies that rely on intrinsic rewards, moral constraints or textual instructions, applied to either pure-Reinforcement Learning or LLM-based agents. By analysing these diverse implementations under one framework, we compare their relative strengths and shortcomings in developing morally aligned AI systems. We then discuss strategies for evaluating the effectiveness of moral learning agents. Finally, we present open research questions and implications for the future of AI safety and ethics which are emerging from this hybrid framework.

arxiv情報

著者	Elizaveta Tennant,Stephen Hailes,Mirco Musolesi
発行日	2025-01-16 15:58:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー