Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

要約

テキストの透かしは、大規模な言語モデル（LLM）のサンプリングプロセスを制御することにより、統計的信号をテキストに微妙に埋め込み、透かし式検出器が指定されたモデルによって出力が生成されたことを確認できるようにすることを目的としています。
これらの透かしのアルゴリズムの堅牢性は、それらの有効性を評価する重要な要因となっています。
現在のテキスト透かし式アルゴリズムは、テキストの品質を確保するために、高エントロピートークンに透かしを埋め込んでいます。
この論文では、この一見良性のデザインが攻撃者によって搾取される可能性があることを明らかにし、透かしの堅牢性に大きなリスクをもたらします。
一般的な効率的な言い換え攻撃、自己情報の書き換え攻撃（SIRA）を導入します。これは、各トークンの自己情報を計算して潜在的なパターントークンを特定し、ターゲット攻撃を実行することにより脆弱性を活用します。
私たちの仕事は、現在の透け式アルゴリズムで広く一般的な脆弱性を明らかにしています。
実験結果は、SIRAが最近の7つの透かし方で100％近くの攻撃成功率を達成していることを示しています。
私たちのアプローチでは、透かしアルゴリズムや透かし型LLMへのアクセスは必要ありません。また、モバイルレベルのモデルでさえ、攻撃モデルとして任意のLLMにシームレスに転送できます。
私たちの調査結果は、より堅牢な透かしの緊急の必要性を強調しています。

要約(オリジナル)

Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)’s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

arxiv情報

著者	Yixin Cheng,Hongcheng Guo,Yangming Li,Leonid Sigal
発行日	2025-05-08 12:39:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー