MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

要約

近年、アテンションベースのモデルはさまざまなドメインで優れていますが、多くの場合、汚染されたデータセットのダウンロードや微調整によるバックドア攻撃に対して依然として脆弱です。
NLP モデルのバックドアを軽減する現在の多くの方法は、事前トレーニングされた (微調整されていない) 重みに依存していますが、これらの方法は、事前トレーニングされた重みが利用できないシナリオでは失敗します。
この研究では、クリーンデータの小さなサブセットのみを利用することで言語モデルのバックドアを軽減でき、事前トレーニングされた重みを必要としない MBTSAD を提案します。
具体的には、MBTSAD は、トークン分割によって生成されたデータセットでバックドアモデルを再トレーニングします。
次に、MBTSAD は注意の蒸留を利用し、再トレーニングされたモデルが教師モデル、元のバックドアモデルが生徒モデルになります。
実験結果は、MBTSAD がクリーンデータでのパフォーマンスを維持しながら、事前トレーニングされた重みに基づく方法と同等のバックドア軽減パフォーマンスを達成することを示しています。
MBTSAD は事前トレーニングされた重みに依存しないため、事前トレーニングされた重みにアクセスできないシナリオでの有用性が高まります。
さらに、敵対的トレーニングの最小-最大問題を単純化し、テキスト表現を視覚化して、MBTSAD の最初のステップのトークン分割方法が配布外 (OOD) データを生成し、モデルがより一般化された機能を学習してバックドアを排除できることを発見しました。
パターン。

要約(オリジナル)

In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD’s first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

arxiv情報

著者	Yidong Ding,Jiafei Niu,Ping Yi
発行日	2025-01-06 04:07:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー