Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

要約

バックドア攻撃は、生成ラージ言語モデル（LLM）にとって依然として重大なセキュリティ脅威である。生成的LLMは、低次元の分類ロジットではなく、高次元のトークンロジットのシーケンスを出力するため、BERTのような識別モデル用に設計された既存のバックドア防御手法のほとんどは、生成的LLMには効果がない。周波数空間におけるバックドアとクリーンマッピングの学習挙動の違いから着想を得て、パラメータの更新に直接影響する各トレーニングサンプルの勾配を周波数空間に変換する。その結果、周波数空間におけるバックドアサンプルとクリーンサンプルの勾配が明瞭に分離していることが明らかになった。この現象に基づき、周波数空間におけるサンプル単位の勾配を活用し、LLMの再トレーニングを必要とせずにバックドアサンプルを効果的に識別する、バックドアサンプルフィルタリングのための周波数空間における勾配クラスタリング（GraCeFul）を提案する。実験結果は、GraCeFulがベースラインを大幅に上回ることを示している。特に、GraCeFulは顕著な計算効率を示し、バックドアサンプルの識別においてほぼ100%のリコールとF1スコアを達成し、複数の自由形式の質問応答データセットにおいて、クリーンな精度のごくわずかな低下で、様々なバックドア攻撃の平均成功率を0%に低減する。さらに、GraCeFulはLlama-2とVicunaに一般化します。コードはhttps://github.com/ZrW00/GraceFul。

要約(オリジナル)

Backdoor attacks remain significant security threats to generative large language models (LLMs). Since generative LLMs output sequences of high-dimensional token logits instead of low-dimensional classification logits, most existing backdoor defense methods designed for discriminative models like BERT are ineffective for generative LLMs. Inspired by the observed differences in learning behavior between backdoor and clean mapping in the frequency space, we transform gradients of each training sample, directly influencing parameter updates, into the frequency space. Our findings reveal a distinct separation between the gradients of backdoor and clean samples in the frequency space. Based on this phenomenon, we propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients in the frequency space to effectively identify backdoor samples without requiring retraining LLMs. Experimental results show that GraCeFul outperforms baselines significantly. Notably, GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples, reducing the average success rate of various backdoor attacks to 0% with negligible drops in clean accuracy across multiple free-style question answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna. The codes are publicly available at https://github.com/ZrW00/GraceFul.

arxiv情報

著者	Zongru Wu,Pengzhou Cheng,Lingyong Fang,Zhuosheng Zhang,Gongshen Liu
発行日	2024-12-03 13:43:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー