Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning

要約

Finetuning-as-a-Service の新しいパラダイムは、Large Language Model (LLM) に新たな攻撃対象領域をもたらします。ユーザーがアップロードしたいくつかの有害なデータによって、Finetuning が簡単に騙されて、調整が崩れたモデルが生成される可能性があります。
私たちは実証分析を実施し、\textit{有害な埋め込みドリフト}現象を明らかにし、アライメント破壊効果の考えられる原因を示します。
私たちの発見に触発されて、私たちは、ユーザーの微調整によるセキュリティリスクを軽減するための摂動を認識した調整技術であるワクチンを提案します。
ワクチンの中心となるアイデアは、アライメントフェーズで巧妙に作成された摂動を徐々に追加することで、不変の隠れた埋め込みを生成することです。
これにより、埋め込みは、微調整段階でサニタイズされていないユーザーデータによる有害な摂動に耐えることができます。
オープンソースの主流LLM（Llama2、Opt、Vicunaなど）に関する我々の結果は、Vaccineが良性のプロンプトに対する推論能力を保持しながら、有害なプロンプトによって引き起こされる埋め込みドリフトに対するアライメントの堅牢性を高めることができることを実証しています。
私たちのコードは \url{https://github.com/git-disl/Vaccine} で入手できます。

要約(オリジナル)

The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.

arxiv情報

著者	Tiansheng Huang,Sihao Hu,Ling Liu
発行日	2024-08-22 04:29:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー