Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models

要約

FM（ファウンデーションモデル）は、安全性という重大な課題に直面している。能力が拡大するにつれて、道具的な収束が人間の制御を失う方向にデフォルトの軌道を押し進め、実存的な破局に至る可能性がある。現在のアライメントアプローチは、価値仕様の複雑さに苦戦し、創発的な権力追求行動に対処できない。我々は「特異な目標としての適格性」（CAST）を提案する。これは、FMを導き、修正し、制御するために、指定された人間のプリンシパルに力を与えることを最優先の目的とするFMの設計である。静的な価値負荷から動的な人間へのエンパワーメントへのこのパラダイムシフトは、道具的な原動力を変容させる：自己保存はプリンシパルのコントロールを維持するためだけに機能し、目標修正はプリンシパルの指導を容易にする。我々は、トレーニング方法論（RLAIF、SFT、合成データ生成）、モデルサイズにわたるスケーラビリティテスト、制御されたインストラクタビリティの実証にまたがる包括的な実証的研究課題を提示する。我々のビジョン人間の判断に取って代わるのではなく、可能な限りツールに近い有益なAIへの道を提供する。これにより、核となるアライメントの問題を根本から解決し、誤った道具的収束に向かう既定の軌道を防ぐことができる。

要約(オリジナル)

Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control, potentially culminating in existential catastrophe. Current alignment approaches struggle with value specification complexity and fail to address emergent power-seeking behaviors. We propose ‘Corrigibility as a Singular Target’ (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them. This paradigm shift from static value-loading to dynamic human empowerment transforms instrumental drives: self-preservation serves only to maintain the principal’s control; goal modification becomes facilitating principal guidance. We present a comprehensive empirical research agenda spanning training methodologies (RLAIF, SFT, synthetic data generation), scalability testing across model sizes, and demonstrations of controlled instructability. Our vision: FMs that become increasingly responsive to human guidance as capabilities grow, offering a path to beneficial AI that remains as tool-like as possible, rather than supplanting human judgment. This addresses the core alignment problem at its source, preventing the default trajectory toward misaligned instrumental convergence.

arxiv情報

著者	Ram Potham,Max Harms
発行日	2025-06-03 16:36:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー