Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

要約

直接選好最適化（DPO）とそのバリアントは、言語モデルを人間の好みに合わせるためにますます使用されています。
これらの方法は、分散した応答と比較して優先応答をより頻繁に生成するためにモデルを教えるように設計されていますが、以前の研究では、トレーニング中に好ましい応答の可能性がしばしば減少することが観察されています。
現在の作業は、このカウンターに反する現象の原因と意味に光を当てており、これは尤度変位と呼ばれています。
尤度の変位は、好ましい応答から反対の意味を持つ応答に対する壊滅的でシフト確率の質量である可能性があることを実証します。
簡単な例として、$ \ texttt {no} $ over $ \ texttt {never} $を好むようにモデルをトレーニングすることは、$ \ texttt {yes} $の確率を大幅に増やすことができます。
さらに、モデルを整合させて安全でないプロンプトを拒否する場合、そのような変位は、有効な拒否反応から有害な反応への確率の質量をシフトすることにより（例えば、74.4.4.4.4.4.4からの拒否率を減らすことにより、無意識につながる可能性があることを示します。
％〜33.4％）。
私たちは、尤度の変位は、中心的な隠された埋め込み類似性（CHES）スコアによって測定されるように、同様の埋め込みを誘導する好みによって駆動されることを理論的に特徴づけています。
経験的には、CHESスコアにより、どのトレーニングサンプルが特定のデータセットで尤度変位に最も寄与するかを特定できます。
これらのサンプルをフィルタリングすると、実験における意図しない不整合を効果的に軽減しました。
さらに広く言えば、私たちの結果は、CHESのスコアが価値があると思われる十分に明確な好みでデータをキュレートすることの重要性を強調しています。

要約(オリジナル)

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

arxiv情報

著者	Noam Razin,Sadhika Malladi,Adithya Bhaskar,Danqi Chen,Sanjeev Arora,Boris Hanin
発行日	2025-02-06 14:49:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー