When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

要約

大規模言語モデル (LLM) はバックドア攻撃に対して脆弱であることが知られており、汚染されたサンプルに埋め込まれたトリガーによって LLM の動作が悪意を持って変更される可能性があります。
このペーパーでは、LLM への攻撃を超えて、自然言語説明という新しいレンズを通してバックドア攻撃を検証します。
具体的には、LLM の生成機能を活用して、人間が判読できる決定の説明を作成し、クリーンなサンプルと汚染されたサンプルの説明を直接比較できるようにします。
私たちの結果は、バックドアモデルはクリーンな入力に対しては一貫した説明を生成しますが、汚染されたデータに対しては多様で論理的に欠陥のある説明を生成し、これはさまざまなバックドア攻撃の分類および生成タスク全体で一貫したパターンであることを示しています。
さらに分析を進めると、説明生成プロセスに関する重要な洞察が明らかになります。
トークンレベルでは、汚染されたサンプルに関連付けられた説明トークンは、最後のいくつかの変圧器層にのみ表示されます。
文レベルでは、注意のダイナミクスは、説明の生成中に、ポイズニングされた入力が元の入力コンテキストから注意をそらすことを示します。
これらの発見は、LLM のバックドアメカニズムについての理解を深め、説明可能性を通じて脆弱性を検出するための有望なフレームワークを提示します。

要約(オリジナル)

Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs’ behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs’ generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.

arxiv情報

著者	Huaizhi Ge,Yiming Li,Qifan Wang,Yongfeng Zhang,Ruixiang Tang
発行日	2024-12-16 16:44:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー