Composite Backdoor Attacks Against Large Language Models

要約

大規模言語モデル (LLM) は、さまざまなタスクにおいて以前の方法と比較して優れたパフォーマンスを実証しており、多くの場合、多くの研究やサービスの基礎モデルとして機能します。
ただし、信頼できないサードパーティ LLM は、ダウンストリームタスクに密かに脆弱性を導入する可能性があります。
このペーパーでは、バックドア攻撃のレンズを通して LLM の脆弱性を調査します。
LLM に対する既存のバックドア攻撃とは異なり、私たちの攻撃は複数のトリガーキーを異なるプロンプトコンポーネントに分散させます。
このような複合バックドア攻撃 (CBA) は、同じ複数のトリガーキーを 1 つのコンポーネントだけに埋め込むよりもステルスであることが示されています。
CBA は、すべてのトリガーキーが表示された場合にのみバックドアがアクティブになることを保証します。
私たちの実験では、CBA が自然言語処理 (NLP) とマルチモーダルタスクの両方で効果的であることが実証されました。
たとえば、感情データセットの LLaMA-7B モデルに対する $3\%$ のポイズニングサンプルを使用すると、攻撃成功率 (ASR) $100\%$ を達成し、誤発動率 (FTR) は $2.06\%$ 未満で無視できます。
モデルの精度が低下します。
私たちの研究は、財団 LLM の信頼性に関するセキュリティ研究を強化する必要性を浮き彫りにしています。

要約(オリジナル)

Large language models (LLMs) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. However, the untrustworthy third-party LLMs may covertly introduce vulnerabilities for downstream tasks. In this paper, we explore the vulnerability of LLMs through the lens of backdoor attacks. Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. CBA ensures that the backdoor is activated only when all trigger keys appear. Our experiments demonstrate that CBA is effective in both natural language processing (NLP) and multimodal tasks. For instance, with $3\%$ poisoning samples against the LLaMA-7B model on the Emotion dataset, our attack achieves a $100\%$ Attack Success Rate (ASR) with a False Triggered Rate (FTR) below $2.06\%$ and negligible model accuracy degradation. Our work highlights the necessity of increased security research on the trustworthiness of foundation LLMs.

arxiv情報

著者	Hai Huang,Zhengyu Zhao,Michael Backes,Yun Shen,Yang Zhang
発行日	2024-03-30 16:09:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Composite Backdoor Attacks Against Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー