Improving Alignment and Robustness with Short Circuiting

要約

AI システムは有害な動作を行う可能性があり、敵対的な攻撃に対して非常に脆弱です。
表現工学の最近の進歩に触発された、有害な出力で応答するモデルを「短絡」するアプローチを紹介します。
拒絶訓練など、アライメントの改善を目的とした既存の技術はバイパスされることが多い。
敵対的トレーニングなどの技術は、特定の攻撃に対抗することでこれらの穴を塞ごうとします。
拒否トレーニングや敵対的トレーニングの代替として、ショートサーキットは、そもそも有害な出力の原因となる表現を直接制御します。
私たちの技術は、テキストのみの言語モデルとマルチモーダル言語モデルの両方に適用でき、目に見えない強力な攻撃が存在する場合でも、実用性を犠牲にすることなく有害な出力の生成を防ぐことができます。
特に、スタンドアロン画像認識における敵対的な堅牢性は依然として未解決の課題ですが、短絡により、より大規模なマルチモーダルシステムは、有害なコンテンツの生成を目的とした画像の「ハイジャック」に確実に耐えることができます。
最後に、AI エージェントへのアプローチを拡張し、攻撃を受けた際の有害なアクションの割合が大幅に減少することを実証しました。
私たちのアプローチは、有害な行為や敵対的攻撃に対する信頼できる保護手段の開発における大きな前進を表しています。

要約(オリジナル)

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that ‘short-circuits’ models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility — even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image ‘hijacks’ that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

arxiv情報

著者	Andy Zou,Long Phan,Justin Wang,Derek Duenas,Maxwell Lin,Maksym Andriushchenko,Rowan Wang,Zico Kolter,Matt Fredrikson,Dan Hendrycks
発行日	2024-06-06 17:57:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Alignment and Robustness with Short Circuiting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー