BITE: Textual Backdoor Attacks with Iterative Trigger Injection

要約

バックドア攻撃は、NLP システムに対する新たな脅威となっています。
ポイズニングされたトレーニングデータを提供することで、攻撃者は被害者モデルに「バックドア」を埋め込むことができます。これにより、特定のテキストパターン (キーワードを含むなど) を満たす入力インスタンスを、攻撃者が選択したターゲットラベルとして予測できるようになります。
この論文では、ステルス性 (つまり、気づきにくい) かつ効果的 (つまり、攻撃成功率が高い) の両方を備えたバックドア攻撃を設計できることを実証します。
私たちは、ターゲットラベルと一連の「トリガーワード」との間に強い相関関係を確立するためにトレーニングデータを汚染するバックドア攻撃である BITE を提案します。
これらのトリガーワードは繰り返し識別され、自然なワードレベルの摂動を通じてターゲットラベルインスタンスに挿入されます。
ポイズニングされたトレーニングデータは、トリガーワードを含む入力のターゲットラベルを予測するように被害者モデルに指示し、バックドアを形成します。
4 つのテキスト分類データセットの実験では、私たちが提案する攻撃が、適切なステルス性を維持しながらベースライン手法よりも大幅に効果的であることが示されており、信頼できないトレーニングデータの使用に警鐘を鳴らしています。
さらに、潜在的なトリガーワード除去に基づいた DeBITE という名前の防御方法を提案します。これは、BITE に対する防御において既存の方法よりも優れた性能を発揮し、他のバックドア攻撃の処理によく一般化します。

要約(オリジナル)

Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a ‘backdoor’ into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary’s choice. In this paper, we demonstrate that it is possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and a set of ‘trigger words’. These trigger words are iteratively identified and injected into the target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four text classification datasets show that our proposed attack is significantly more effective than baseline methods while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods in defending against BITE and generalizes well to handling other backdoor attacks.

arxiv情報

著者	Jun Yan,Vansh Gupta,Xiang Ren
発行日	2023-05-29 17:59:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BITE: Textual Backdoor Attacks with Iterative Trigger Injection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー