Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

要約

事前トレーニングと微調整の間のギャップを埋めるプロンプトベースの学習パラダイムは、いくつかの NLP タスク、特にショット数が少ない設定で最先端のパフォーマンスを実現します。
プロンプトベースの学習は広く適用されているにもかかわらず、バックドア攻撃に対して脆弱です。
テキストバックドア攻撃は、トリガーインジェクションとラベル変更を通じてトレーニングサンプルのサブセットを汚染することにより、モデルに標的を絞った脆弱性を導入するように設計されています。
しかし、それらは、トリガーによって異常な自然言語表現が発生したり、毒物サンプルの誤ったラベルが貼られたりするなどの欠陥を抱えています。
この研究では、プロンプト自体をトリガーとして使用し、プロンプトに基づいてクリーンラベルのバックドア攻撃を実行するための斬新で効率的な方法である Pro Attack を提案します。
私たちの方法は外部トリガーを必要とせず、汚染されたサンプルの正しいラベル付けを保証し、バックドア攻撃のステルス性を向上させます。
豊富なリソースと数ショットのテキスト分類タスクに関する広範な実験により、テキストバックドア攻撃における Pro Attack の競合パフォーマンスを実証的に検証しました。
特に、豊富なリソース設定では、Pro Attack は外部トリガーなしでクリーンラベルのバックドア攻撃ベンチマークで最先端の攻撃成功率を達成します。

要約(オリジナル)

The prompt-based learning paradigm, which bridges the gap between pre-training and fine-tuning, achieves state-of-the-art performance on several NLP tasks, particularly in few-shot settings. Despite being widely applied, prompt-based learning is vulnerable to backdoor attacks. Textual backdoor attacks are designed to introduce targeted vulnerabilities into models by poisoning a subset of training samples through trigger injection and label modification. However, they suffer from flaws such as abnormal natural language expressions resulting from the trigger and incorrect labeling of poisoned samples. In this study, we propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt, which uses the prompt itself as a trigger. Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack. With extensive experiments on rich-resource and few-shot text classification tasks, we empirically validate ProAttack’s competitive performance in textual backdoor attacks. Notably, in the rich-resource setting, ProAttack achieves state-of-the-art attack success rates in the clean-label backdoor attack benchmark without external triggers.

arxiv情報

著者	Shuai Zhao,Jinming Wen,Luu Anh Tuan,Junbo Zhao,Jie Fu
発行日	2023-11-10 11:28:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー