An Approach to Technical AGI Safety and Security

要約

人工一般情報（AGI）は、変革的利益を約束しますが、重大なリスクも提示します。
私たちは、人類を大幅に害するのに十分な結果的に害のリスクに対処するためのアプローチを開発します。
リスクの4つの領域を特定します：誤用、誤った整合、間違い、構造的リスク。
これらのうち、私たちは誤用と誤った整合をするための技術的なアプローチに焦点を当てています。
誤用のために、私たちの戦略は、危険な能力を積極的に特定し、堅牢なセキュリティ、アクセス制限、監視、モデルの安全緩和を実装することにより、脅威関係者が危険な機能にアクセスするのを防ぐことを目的としています。
不整合に対処するために、2つの防御線の概要を説明します。
第一に、増幅された監視や堅牢なトレーニングなどのモデルレベルの緩和は、整合したモデルの構築に役立ちます。
第二に、監視やアクセス制御などのシステムレベルのセキュリティ測定値は、モデルが誤って調整されていても、害を軽減する可能性があります。
解釈可能性、不確実性の推定、より安全な設計パターンからのテクニックは、これらの緩和の有効性を高めることができます。
最後に、これらの成分をどのように組み合わせてAGIシステムの安全性ケースを生成できるかを簡単に概説します。

要約(オリジナル)

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.

arxiv情報

著者	Rohin Shah,Alex Irpan,Alexander Matt Turner,Anna Wang,Arthur Conmy,David Lindner,Jonah Brown-Cohen,Lewis Ho,Neel Nanda,Raluca Ada Popa,Rishub Jain,Rory Greig,Samuel Albanie,Scott Emmons,Sebastian Farquhar,Sébastien Krier,Senthooran Rajamanoharan,Sophie Bridgers,Tobi Ijitoye,Tom Everitt,Victoria Krakovna,Vikrant Varma,Vladimir Mikulik,Zachary Kenton,Dave Orr,Shane Legg,Noah Goodman,Allan Dafoe,Four Flynn,Anca Dragan
発行日	2025-04-02 15:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Approach to Technical AGI Safety and Security

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー