Language Models Resist Alignment: Evidence From Data Compression

要約

大規模な言語モデル（LLMS）は、意図しないまたは望ましくない行動を示す場合があります。
最近の作品は、有害な出力を緩和するためにLLMSを調整することに集中しています。
これらの努力にもかかわらず、いくつかの異常は、意図的であろうと偶然であろうと、適切に伝導されたアライメントプロセスでさえ簡単に回避できることを示しています。
アラインメント微調整収量はモデルに堅牢な影響を及ぼしますか、それともその影響は単に表面的ですか？
この作業では、理論的および経験的な視点の両方からこの現象の最初の調査を行います。
経験的には、ポストアライメントモデルの$ \ mathbf {Elasticity} $、つまり、さらに微調整するとトレーニング前の段階で形成された動作分布に戻る傾向を示します。
圧縮理論を活用すると、微調整は、潜在的に桁違いに潜在的に訓練前にアラインメントを不均衡に損なうことを正式に推測します。
さまざまなタイプとスケールのモデルに関する実験を通じて弾力性の存在を検証します。
具体的には、トレーニング前分布に戻る前にモデルのパフォーマンスが急速に低下し、その後減少率が大幅に低下することがわかります。
さらに、弾力性は、モデルサイズの増加とトレーニング前のデータの拡張と正の相関があることをさらに明らかにします。
私たちの調査結果は、LLMSの固有の弾力性に対処して、アライメントに対する抵抗を軽減する必要性を強調しています。
モデルの重みとコードは、PKU-LM-Resist-Alignment.github.ioで利用できます。

要約(オリジナル)

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

arxiv情報

著者	Jiaming Ji,Kaile Wang,Tianyi Qiu,Boyuan Chen,Jiayi Zhou,Changye Li,Hantao Lou,Juntao Dai,Yunhuai Liu,Yaodong Yang
発行日	2025-06-11 17:23:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Models Resist Alignment: Evidence From Data Compression

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー