Baseline Defenses for Adversarial Attacks Against Aligned Language Models

要約

ラージ・ランゲージ・モデルが急速にユビキタスになるにつれて、そのセキュリティの脆弱性を理解することは非常に重要である。最近の研究では、テキストオプティマイザが、節度と整列をバイパスする脱獄プロンプトを生成できることが示されている。敵対的機械学習に関する豊富な研究から、我々は3つの疑問を持ってこれらの攻撃にアプローチする：この領域で実用的に有用な脅威モデルは何か？この新しい領域において、ベースラインの防御技術はどのように機能するのか？LLMのセキュリティはコンピュータビジョンとどう違うのか？我々は、LLMに対する主要な敵対的攻撃に対するいくつかの基本的な防御戦略を評価し、それぞれが実行可能で効果的である様々な設定について議論する。特に、検出（パープレキシティに基づく）、入力の前処理（パラフレーズと再認識）、敵対的トレーニングの3つのタイプの防御について検討する。ホワイトボックスとグレーボックスの設定について議論し、各防御の頑健性と性能のトレードオフについて議論する。意外なことに、フィルタリングと前処理は、視覚などの他の領域から予想されるよりもはるかに多くの成功を収め、これらの防御の相対的な強さがこれらの領域で異なる重みを持つ可能性があることを初めて示す。

要約(オリジナル)

As Large Language Models quickly become ubiquitous, their security vulnerabilities are critical to understand. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. Surprisingly, we find much more success with filtering and preprocessing than we would expect from other domains, such as vision, providing a first indication that the relative strengths of these defenses may be weighed differently in these domains.

arxiv情報

著者	Neel Jain,Avi Schwarzschild,Yuxin Wen,Gowthami Somepalli,John Kirchenbauer,Ping-yeh Chiang,Micah Goldblum,Aniruddha Saha,Jonas Geiping,Tom Goldstein
発行日	2023-09-01 17:59:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー