Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation

要約

パラメーター効率の良い微調整 (PEFT) は、大規模言語モデル (LLM) と下流のタスクの間のギャップを埋めることができます。
ただし、PEFT は悪意のある攻撃に対して脆弱であることが判明しています。
研究によると、汚染された LLM は、PEFT 後でも、入力サンプルに事前定義されたトリガーが含まれている場合に、内部化されたバックドアをアクティブにする機能を保持していることが示されています。
この論文では、W2S Defense と呼ばれる、機能アライメント知識の蒸留に基づいてバックドア攻撃を防御するための新しい弱から強への再学習アルゴリズムを紹介します。
具体的には、最初にフルパラメータ微調整を通じて小規模な言語モデルをトレーニングし、クリーンな教師モデルとして機能させます。
次に、この教師モデルは、PEFT を活用して、大規模な毒殺された学生モデルをバックドアの学習解除に導きます。
理論的分析によると、W2S Defense には学生モデルのバックドア機能の学習を解除する能力を強化し、バックドアのアクティブ化を防ぐ可能性があることが示唆されています。
私たちは、3 つの最先端の言語モデルと 3 つの異なるバックドア攻撃アルゴリズムを含むテキスト分類タスクの実験を行っています。
私たちの実証結果は、モデルのパフォーマンスを損なうことなくバックドア攻撃を防御する W2S Defense の優れたパフォーマンスを示しています。

要約(オリジナル)

Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model’s ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.

arxiv情報

著者	Shuai Zhao,Xiaobao Wu,Cong-Duy Nguyen,Meihuizi Jia,Yichao Feng,Luu Anh Tuan
発行日	2024-10-18 12:39:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー