Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

要約

この研究では、大規模言語モデル (LLM) の未学習の問題に対処し、最初から再トレーニングすることなく、重要なモデルユーティリティを維持しながら、不要なデータの影響と関連するモデル機能 (著作権で保護されたデータや有害なコンテンツの生成など) を除去することを目指しています。
。
LLM アンラーニングの必要性が高まっているにもかかわらず、原則に基づいた最適化フレームワークが依然として不足しています。
この目的を達成するために、我々は最先端のアプローチである負の選好最適化 (NPO) を再考し、特にさまざまな難易度のデータを忘れて学習しない場合に、NPO の有効性を損なう可能性がある参照モデルのバイアスの問題を特定します。
それを踏まえて、我々は SimNPO と呼ばれるシンプルだが効果的な非学習最適化フレームワークを提案し、(単純な優先最適化のレンズを通して) 参照モデルへの依存を取り除く「単純さ」が非学習に利益をもたらすことを示します。
また、マルコフ連鎖の混合を使用した分析によって裏付けられた、SimNPO の利点についてのより深い洞察も提供します。
さらに、TOFU や MUSE などのベンチマークにおける既存の非学習ベースラインに対する SimNPO の優位性と、再学習攻撃に対する堅牢性を検証する広範な実験を紹介します。
コードは https://github.com/OPTML-Group/Unlearn-Simple で入手できます。

要約(オリジナル)

In this work, we address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities (e.g., copyrighted data or harmful content generation) while preserving essential model utilities, without the need for retraining from scratch. Despite the growing need for LLM unlearning, a principled optimization framework remains lacking. To this end, we revisit the state-of-the-art approach, negative preference optimization (NPO), and identify the issue of reference model bias, which could undermine NPO’s effectiveness, particularly when unlearning forget data of varying difficulty. Given that, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that ‘simplicity’ in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We also provide deeper insights into SimNPO’s advantages, supported by analysis using mixtures of Markov chains. Furthermore, we present extensive experiments validating SimNPO’s superiority over existing unlearning baselines in benchmarks like TOFU and MUSE, and robustness against relearning attacks. Codes are available at https://github.com/OPTML-Group/Unlearn-Simple.

arxiv情報

著者	Chongyu Fan,Jiancheng Liu,Licong Lin,Jinghan Jia,Ruiqi Zhang,Song Mei,Sijia Liu
発行日	2024-10-09 17:58:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー