SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

要約

下流のタスクでの大規模な言語モデル（LLMS）を微調整すると、良性の微調整データセットであっても、安全なアライメントを誤って侵食する可能性があります。
タスクユーティリティを維持しながら安全性を維持するファインポストチューニングフレームワークであるSafeMergeを提案することにより、この課題に対処します。
これは、コサインの類似性基準で測定された安全な行動から逸脱している場合にのみ、微調整された微調整されたモデル層を選択的に統合することにより達成します。
さまざまなマージ戦略を調査しながら、GSM8KおよびPubMedQAタスクのLLAMA-2-7B-ChatおよびQWEN-2-7B-Instructモデルの他の微調整および微調整後の段階的アプローチに対してSafemergeを評価します。
Safemergeは、パフォーマンスを大幅に犠牲にすることなく、他のベースラインと比較して一貫して有害な出力を減らし、時にはそれを強化することがあることがわかります。
結果は、我々の選択的、部分空間誘導、および層ごとの合併方法が、よりシンプルな調整後の段階防御を上回りながら、微調整されたLLMSの不注意な安全性の損失に対する効果的な保護を提供することを示唆しています。

要約(オリジナル)

Fine-tuning large language models (LLMs) on downstream tasks can inadvertently erode their safety alignment, even for benign fine-tuning datasets. We address this challenge by proposing SafeMERGE, a post-fine-tuning framework that preserves safety while maintaining task utility. It achieves this by selectively merging fine-tuned and safety-aligned model layers only when those deviate from safe behavior, measured by a cosine similarity criterion. We evaluate SafeMERGE against other fine-tuning- and post-fine-tuning-stage approaches for Llama-2-7B-Chat and Qwen-2-7B-Instruct models on GSM8K and PubMedQA tasks while exploring different merging strategies. We find that SafeMERGE consistently reduces harmful outputs compared to other baselines without significantly sacrificing performance, sometimes even enhancing it. The results suggest that our selective, subspace-guided, and per-layer merging method provides an effective safeguard against the inadvertent loss of safety in fine-tuned LLMs while outperforming simpler post-fine-tuning-stage defenses.

arxiv情報

著者	Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Syed Zawad,Holger Boche
発行日	2025-03-21 15:44:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー