BITE: Textual Backdoor Attacks with Iterative Trigger Injection

要約

バックドア攻撃は、NLP システムに対する新たな脅威となっています。
汚染されたトレーニングデータを提供することで、攻撃者は被害者モデルに「バックドア」を埋め込むことができます。これにより、特定のテキストパターン (キーワードを含むなど) を満たす入力インスタンスを、攻撃者が選択したターゲットラベルとして予測できるようになります。
このホワイトペーパーでは、ステルス (つまり、気づきにくい) かつ効果的な (つまり、攻撃の成功率が高い) バックドア攻撃を設計できることを示します。
BITE は、自然な単語レベルの摂動を介してターゲットラベルインスタンスに反復的に注入することにより、トレーニングデータを汚染してターゲットラベルと一部の「トリガーワード」との間に強い相関関係を確立するバックドア攻撃です。
汚染されたトレーニングデータは、バックドアを形成するトリガーワードを含む入力のターゲットラベルを予測するよう被害者モデルに指示します。
4 つの中規模のテキスト分類データセットでの実験では、BITE がベースラインよりもはるかに効果的でありながら、適切なステルス性を維持していることが示され、信頼できないトレーニングデータの使用について警告が発せられました。
さらに、潜在的なトリガーワードの除去に基づく DeBITE という名前の防御方法を提案します。これは、BITE の防御に関する既存の方法よりも優れており、他のバックドア攻撃の防御によく一般化されます。

要約(オリジナル)

Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a “backdoor” into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary’s choice. In this paper, we demonstrate that it’s possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and some “trigger words”, by iteratively injecting them into target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four medium-sized text classification datasets show that BITE is significantly more effective than baselines while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods on defending BITE and generalizes well to defending other backdoor attacks.

arxiv情報

著者	Jun Yan,Vansh Gupta,Xiang Ren
発行日	2023-02-16 13:02:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BITE: Textual Backdoor Attacks with Iterative Trigger Injection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー