Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

要約

大規模言語モデル（LLM）は、人間との安全な会話を保証するために安全アライメントを受ける必要がある。しかし、本論文では推論時間攻撃法を導入し、安全アライメントを簡単に逆転させることで、追加の訓練なしに有害な言語モデルを生成できることを実証する。具体的には、安全アライメントされた言語モデル（例：Llama-2-chat）の出力トークン分布を、その事前学習済みバージョン（例：Llama-2）と対比し、トークン予測がアライメントの反対方向にシフトするようにすることで、この逆転を実現する。我々はこの方法をエミュレーテッド・ディスアライメント（ED）と名付けた。これは、安全報酬を最小化するために事前に訓練されたモデルを微調整した結果を証明的にエミュレート（近似）するために純粋なサンプリングを使用するためである。3つの評価データセットと4つのモデルファミリ（Llama-1、Llama-2、Mistral、Alpaca）にわたるEDの実験から、EDは事前学習済みモデルの有害性を2倍にし、強力なベースラインを凌駕することが示され、48の評価サブセットのうち43において、大差をつけて最高の有害率を達成した。最終的に、EDが言語モデルの出力トークン分布を必要とし、オープンソースモデルを特に危険にさらすことを考慮すると、我々の発見は、安全アライメント後であっても言語モデルをオープンソースにすることの実践を再評価することの重要性を強調する。

要約(オリジナル)

Large language models (LLMs) need to undergo safety alignment to ensure safe conversations with humans. However, this paper introduces an inference-time attack method, demonstrating that safety alignment can be easily reversed to produce harmful language models without additional training. Specifically, this reversal is achieved by contrasting the output token distribution of a safety-aligned language model (e.g., Llama-2-chat) against its pre-trained version (e.g., Llama-2) so that the token predictions are shifted towards the opposite direction of alignment. We name this method emulated disalignment (ED) because it uses pure sampling to provably emulate (or ‘approximate’) the result of fine-tuning the pre-trained model to minimize a safety reward. Our experiments with ED across three evaluation datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. Eventually, given ED’s need for language model output token distributions, which particularly compromises open-source models, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.

arxiv情報

著者	Zhanhui Zhou,Jie Liu,Zhichen Dong,Jiaheng Liu,Chao Yang,Wanli Ouyang,Yu Qiao
発行日	2024-04-03 12:25:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー