A Closer Look at Machine Unlearning for Large Language Models

要約

大規模な言語モデル（LLM）は、プライバシーや法的な懸念を引き起こす、機密または著作権のあるコンテンツを記憶する可能性がある。ゼロから再学習を行うには高いコストがかかるため、研究者は、全体的な性能を維持しながらLLMから特定のコンテンツを削除するために、機械学習解除を採用しようと試みている。本稿では、LLMのための機械学習解除におけるいくつかの問題を議論し、可能なアプローチに関する我々の洞察を提供する。学習解除後のモデル出力の評価が不十分であるという問題に対処するため、トークンの多様性、文の意味、事実の正しさを評価する3つの追加指標を導入する。次に、非学習法を非対象型と対象型に分類し、それぞれの問題点を議論する。具体的には、非ターゲットの非学習が近似しようとする振る舞いは予測不可能であり、幻覚を伴う可能性がある。これらの問題を緩和するために、我々は、非対象学習に対してはエントロピー最大化(ME)を目的とし、対象学習に対しては正則化として解答保存(AP)損失を組み込むことを提案する。架空の非学習、継続的な非学習、実世界の非学習という3つのシナリオにおける実験結果は、我々のアプローチの有効性を実証している。コードはhttps://github.com/sail-sg/closer-look-LLM-unlearning。

要約(オリジナル)

Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at https://github.com/sail-sg/closer-look-LLM-unlearning.

arxiv情報

著者	Xiaojian Yuan,Tianyu Pang,Chao Du,Kejiang Chen,Weiming Zhang,Min Lin
発行日	2025-03-03 02:45:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

A Closer Look at Machine Unlearning for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー