UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models

要約

大規模な言語モデルでは、知識の競合や時代遅れの情報（たとえば、間違っている、プライベート、または違法な内容などの課題に対処するための反復的な更新が必要です。
Machine Inlarningは、訓練されたモデルからターゲットを絞った知識除去のための体系的な方法論を提供し、機密情報の影響を排除できるようにします。
ただし、主流の微調整ベースの未学習方法は、学習の有効性とモデル能力のバランスを取ることができず、多くの場合、広範な知識除去の下で壊滅的なモデルの崩壊をもたらすことがよくあります。
一方、モデルの本質的なメカニズムを変更せずにコンテキストプロンプトのみに依存しているコンテキスト内学習は、限られた一般化可能性と真の学習を達成するための闘争に苦しんでいます。
この作業では、学習可能なパラメトリック接尾辞（トークンを学習していないトークン）を使用して、ターゲットを絞った忘却行動に向けて言語モデルを導く小説であるUnieraseを紹介します。
Unieraseは、2つの重要なフェーズで動作します。（i）トークン最適化を介してモデルの自己回帰確率分布に希望する希望の出力を結合する最適化段階、続いて、（ii）学習されたトークンをアクティブにして、特定の忘却の目的を確率的に誘導する軽量モデル編集フェーズが続きます。
Unieraseは、学習を誘導するためのトークン学習のための新しい研究の方向として機能し、バッチ、シーケンシャル、および架空の実世界の知識設定の下での最新の（SOTA）パフォーマンスを達成します。
驚くべきことに、豆腐のベンチマーク、Unieraseの点では、LLMパラメーターの約3.66％のみを変更するため、モデル能力のために以前の忘却Sotaベースラインを約4.01倍も上回ります。
同様に、より多くの能力を維持しているUnieraseは、現在の不格なドメインでデュアルトップ層のパフォーマンスを示すために、以前の維持SOTAを35.96％上回っています。

要約(オリジナル)

Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model’s intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model’s autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.

arxiv情報

著者	Miao Yu,Liang Lin,Guibin Zhang,Xinfeng Li,Junfeng Fang,Ningyu Zhang,Kun Wang,Yang Wang
発行日	2025-05-21 15:53:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー