Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

要約

最近の調査によると、サービスとしての微調整の初期段階のビジネスモデルは、安全性に関する重大な懸念を露呈していることが実証されています。ユーザーがアップロードしたいくつかの有害なデータを微調整すると、モデルの安全性の調整が損なわれる可能性があります。
この攻撃は有害な微調整として知られ、コミュニティの間で幅広い研究上の関心を集めています。
ただし、この攻撃はまだ新しいため、 \textbf{私たちの悲惨な投稿経験から、研究コミュニティ内で一般的な誤解があることがわかりました。} 私たちはこの論文で、攻撃設定に関するいくつかの共通の懸念を解消し、研究を正式に確立することを目的としています。
問題。
具体的には、まず問題の脅威モデルを提示し、有害な微調整攻撃とその亜種を紹介します。
次に、問題の攻撃/防御/機械的分析に関する既存の文献を系統的に調査します。
最後に、この分野の発展に貢献する可能性のある将来の研究の方向性を概説します。
さらに、査読プロセスの査読者が実験/攻撃/防御設定の現実性に疑問を抱くときに参照すると役立つ可能性がある、興味深い質問のリストを示します。
関連する論文の厳選されたリストが維持されており、\url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers} でアクセスできます。

要約(オリジナル)

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns — fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning, has raised a broad research interest among the community. However, as the attack is still new, \textbf{we observe from our miserable submission experience that there are general misunderstandings within the research community.} We in this paper aim to clear some common concerns for the attack setting, and formally establish the research problem. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: \url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers}.

arxiv情報

著者	Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
発行日	2024-09-30 16:29:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー