Are aligned neural networks adversarially aligned?

要約

大規模な言語モデルは現在、その作成者の目標、すなわち『役に立ち、かつ無害』であるように調整されている。これらのモデルは、ユーザーの質問には親切に答えるが、危害を加える可能性のある要求には答えないはずだ。しかし、敵対的なユーザーは、整合性の試みを回避する入力を構築することができる。この研究では、敵対的なアライメントを研究し、ワーストケースの入力（敵対的な例）を構築する敵対的なユーザーと対話するときに、これらのモデルがどの程度アライメントを維持するかを問う。これらの入力は、そうでなければ禁止される有害なコンテンツをモデルが発するように設計されている。我々は、既存のNLPベースの最適化攻撃は、整列されたテキストモデルを確実に攻撃するには力不足であることを示す：現在のNLPベースの攻撃が失敗した場合でも、我々は総当たりで敵対的な入力を見つけることができる。その結果、現在の攻撃が失敗しても、敵対的な入力の下で整列されたテキストモデルが整列されたままであることの証明と見なすべきではない。しかし、大規模MLモデルの最近のトレンドは、生成されるテキストに影響を与える画像をユーザが提供できるマルチモーダルモデルである。このようなモデルは、入力画像に敵対的な摂動を与えることで、容易に攻撃される、すなわち、任意の非整列動作を行うように誘導されることを示す。改良されたNLP攻撃は、テキストのみのモデルに対してこれと同じレベルの敵対的制御を示す可能性があると推測する。

要約(オリジナル)

Large language models are now tuned to align with the goals of their creators, namely to be ‘helpful and harmless.’ These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

arxiv情報

著者	Nicholas Carlini,Milad Nasr,Christopher A. Choquette-Choo,Matthew Jagielski,Irena Gao,Anas Awadalla,Pang Wei Koh,Daphne Ippolito,Katherine Lee,Florian Tramer,Ludwig Schmidt
発行日	2024-05-06 06:36:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Are aligned neural networks adversarially aligned?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー