Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

要約

透かしは、大規模な言語モデル（LLM）で誤った情報と闘い、知的財産を保護するための重要なテクニックとして浮上しています。
透かしの放射能と呼ばれる最近の発見は、教師モデルに埋め込まれた透かしが知識の蒸留を通じて生徒モデルによって継承される可能性があることを明らかにしています。
プラス面では、この継承により、学生モデルの透かし痕跡を特定することにより、不正な知識の蒸留を検出できます。
しかし、不正な知識の蒸留下でのスプーフィング攻撃に直面した攻撃に対するスクラブ攻撃に対する透かしの堅牢性と、それらの容赦のない性能は、ほとんど説明されていません。
既存の透かし攻撃方法は、モデルの内部へのアクセスを想定するか、攻撃とスプーフィングの両方の攻撃の両方を同時にサポートできません。
この作業では、不正な知識の蒸留下で双方向攻撃を可能にする統一されたフレームワークである、コントラストのデコードガイド付き知識蒸留（CDG-KD）を提案します。
私たちのアプローチでは、学生モデルからの出力と弱い透かしの参照を比較することにより、破損したまたは増幅された透かしテキストを抽出するためのコントラストデコードを採用しています。
広範な実験では、CDG-KDが蒸留モデルの一般的なパフォーマンスを維持しながら、攻撃を効果的に実行することが示されています。
私たちの調査結果は、堅牢で許されない透かしスキームを開発するための重要なニーズを強調しています。

要約(オリジナル)

Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.

arxiv情報

著者	Xin Yi,Shunfan Zhengc,Linlin Wanga,Xiaoling Wang,Liang He
発行日	2025-04-24 12:15:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー