Baseline Defenses for Adversarial Attacks Against Aligned Language Models

要約

大規模言語モデルが急速に普及するにつれて、そのセキュリティの脆弱性を理解することが重要になります。
最近の研究では、テキストオプティマイザーがモデレーションと配置をバイパスする脱獄プロンプトを生成できることが示されています。
敵対的機械学習に関する豊富な研究結果を基に、私たちは次の 3 つの質問からこれらの攻撃に取り組みます。この分野ではどのような脅威モデルが実際に役立つのか?
この新しい領域では、ベースラインの防御技術がどのように機能するのでしょうか?
LLM セキュリティはコンピュータビジョンとどのように異なりますか?
私たちは、LLM に対する主要な敵対的攻撃に対するいくつかの基本的な防御戦略を評価し、それぞれが実行可能かつ効果的であるさまざまな設定について説明します。
特に、検出 (混乱ベース)、入力前処理 (言い換えと再トークン化)、および敵対的トレーニングの 3 つのタイプの防御に注目します。
ホワイトボックスとグレーボックスの設定について説明し、考慮されたそれぞれの防御の堅牢性とパフォーマンスのトレードオフについて説明します。
テキストに対する既存の個別オプティマイザーの弱点と、最適化コストが比較的高いことが、LLM にとって標準的な適応型攻撃をより困難なものにしていることがわかりました。
より強力なオプティマイザが開発できるかどうか、あるいはフィルタリングと前処理の防御の強度がコンピュータビジョンよりも LLM ドメインの方が強いかどうかを明らかにするには、今後の研究が必要になるでしょう。

要約(オリジナル)

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

arxiv情報

著者	Neel Jain,Avi Schwarzschild,Yuxin Wen,Gowthami Somepalli,John Kirchenbauer,Ping-yeh Chiang,Micah Goldblum,Aniruddha Saha,Jonas Geiping,Tom Goldstein
発行日	2023-09-04 17:47:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー