Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

要約

人工知能システムがより強力になるにつれて、新たなリスクや将来のリスクに対処するための「AI 安全性」研究への関心が高まっています。
しかし、AI の安全性の分野は依然として定義が不十分であり、測定も一貫性がないため、研究者がどのように貢献できるかについて混乱が生じています。
この明確さの欠如は、AI 安全性ベンチマークと上流の一般的な能力 (一般知識や推論など) との間の関係が不明確であることによってさらに悪化します。
これらの問題に対処するために、私たちは AI 安全性ベンチマークの包括的なメタ分析を実施し、数十のモデルにわたる一般的な機能との相関関係を実証的に分析し、AI 安全性における既存の方向性の調査を提供します。
私たちの調査結果では、多くの安全ベンチマークが上流モデルの機能と高度に相関しており、機能の向上が安全の進歩として誤って伝えられる「セーフティウォッシング」が可能になる可能性があることが明らかになりました。
これらの発見に基づいて、私たちはより意味のある安全性指標を開発するための経験的基盤を提案し、機械学習研究の文脈における AI の安全性を、一般的な機能の進歩から経験的に分離できる、明確に描写された一連の研究目標として定義します。
そうすることで、私たちは AI の安全性研究のためのより厳密な枠組みを提供し、安全性評価の科学を進歩させ、目に見える進歩に向けた道筋を明確にすることを目指しています。

要約(オリジナル)

As artificial intelligence systems grow more powerful, there has been increasing interest in ‘AI safety’ research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with upstream model capabilities, potentially enabling ‘safetywashing’ — where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

arxiv情報

著者	Richard Ren,Steven Basart,Adam Khoja,Alice Gatti,Long Phan,Xuwang Yin,Mantas Mazeika,Alexander Pan,Gabriel Mukobi,Ryan H. Kim,Stephen Fitz,Dan Hendrycks
発行日	2024-07-31 17:59:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー