Breaking Distortion-free Watermarks in Large Language Models

要約

近年、LLMの透かしは、多くの現実世界ドメインで有望なアプリケーションを備えたAIの生成コンテンツに対する魅力的な保護手段として浮上しています。
ただし、現在のLLM透かしスキームは、透かしのメカニズムを逆転させたい専門家の敵に対して脆弱であるという懸念が高まっています。
LLM透かしを破壊または盗むことの以前の研究は、主にKirchenbauer et alの分布修飾アルゴリズムに焦点を当てています。
（2023）、サンプリング前にロジットベクトルを転用します。
この作業では、隠された透かしキーシーケンスを使用して基礎となるトークン分布を保持する他の著名なLLM透かしスキーム、歪みのない透かし（Kuditipudi etal。2024）のリバースエンジニアリングに焦点を当てています。
より洗練された透かしスキームの下でも、LLMを妥協してスプーフィング攻撃を実行すること、つまり、元の透かし型LLMに起因する多数の（潜在的に有害な）テキストを生成することが可能であることを実証します。
具体的には、LLMの透かしのための基礎となる秘密鍵を正確に回復するために、適応プロンプトとソートベースのアルゴリズムを使用して提案します。
llama-3.1-8b-instruct、mistral-7b-instruct、gemma-7b、およびopt-125mに関する経験的な調査結果は、歪みのない透け式技術の堅牢性と使いやすさに関する現在の理論的主張に挑戦します。

要約(オリジナル)

In recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in breaking or stealing LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to compromise the LLM and carry out a spoofing attack, i.e. generate a large number of (potentially harmful) texts that can be attributed to the original watermarked LLM. Specifically, we propose using adaptive prompting and a sorting-based algorithm to accurately recover the underlying secret key for watermarking the LLM. Our empirical findings on LLAMA-3.1-8B-Instruct, Mistral-7B-Instruct, Gemma-7b, and OPT-125M challenge the current theoretical claims on the robustness and usability of the distortion-free watermarking techniques.

arxiv情報

著者	Shayleen Reynolds,Hengzhi He,Dung Daniel T. Ngo,Saheed Obitayo,Niccolò Dalmasso,Guang Cheng,Vamsi K. Potluru,Manuela Veloso
発行日	2025-06-12 16:26:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Breaking Distortion-free Watermarks in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー