Language Models Resist Alignment: Evidence From Data Compression

要約

大規模言語モデル (LLM) は、意図しない動作や望ましくない動作を示す場合があります。
最近の研究は、有害な出力を軽減するために LLM を調整することに重点を置いています。
これらの努力にもかかわらず、一部の異常は、適切に行われた位置合わせプロセスでさえ、意図的か偶然かにかかわらず、簡単に回避できることを示しています。
アライメントの微調整歩留まりはモデルに大きな影響を及ぼしますか、それともその影響は単に表面的なものでしょうか?
この研究では、理論と経験の両方の観点からこの現象を初めて調査します。
経験的に、調整後のモデルの弾力性、つまり、さらに微調整すると、トレーニング前の段階で形成された動作分布に戻る傾向があることを実証しました。
圧縮理論を活用して、微調整は事前トレーニングに比べてアライメントを不釣り合いに、潜在的に桁違いに損なう可能性があると正式に推論します。
さまざまな種類やスケールの模型を用いた実験を通じて、弾性の存在を検証します。
具体的には、モデルのパフォーマンスがトレーニング前の分布に戻る前に急速に低下し、その後は低下率が大幅に低下することがわかります。
さらに、弾力性がモデルサイズの増加と事前トレーニングデータの拡大と正の相関があることも明らかにしました。
私たちの発見は、整列に対する抵抗を軽減するために LLM の固有の弾性に対処する必要性を強調しています。

要約(オリジナル)

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

arxiv情報

著者	Jiaming Ji,Kaile Wang,Tianyi Qiu,Boyuan Chen,Jiayi Zhou,Changye Li,Hantao Lou,Josef Dai,Yunhuai Liu,Yaodong Yang
発行日	2024-12-20 16:25:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Models Resist Alignment: Evidence From Data Compression

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー