From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

要約

アライメントプロセスは、大規模な言語モデル（LLM）出力分布のいくつかのプロパティを変更します。
LLM応答の整列後分布シフトの2つの側面を分析します。
まず、整理後の応答の多様性の削減を以前に報告した削減を再検討します。
私たちの分析は、反応の多様性の明らかな低下が、品質管理と情報集約によって大部分が説明されていることを示唆しています。
アラインメントは、ベースLLMからのいくつかの応答にまたがる情報をカバーするより長い応答に向けて出力分布をシフトしながら、無関係で役に立たないコンテンツを抑制し、基本的に単一の応答で多様な情報を提示します。
アラインメントが有用な情報を抑制するという証拠をほとんど見つけておらず、反対の質問をするのは自然です：アライメントされたモデルは、ベースモデルから回復できない表面情報を整列させますか？
2番目の調査によると、これは事実ではなく、アラインドモデルの動作は微調整なしでベースモデルから回復可能であることが示されています。
コンテキスト内の例と応答コンテンツに関する低解像度のセマンティックヒントの組み合わせは、アライメントチューニングされたLLM応答と同様に、アライメントチューニングLLM応答と同様のベースLLMからの応答を引き出すことができます。
まとめると、これらの結果は、現在のアライメント手法がキャプチャしますが、アシスタントのようなベースLLM行動の有用なサブセットを拡張しないことを示しており、表面的なアライメント仮説のさらなる証拠を提供します。
彼らはまた、微調整なしでアライメントされたLLMを模倣するための戦略として驚くほど驚くべきことになる可能性があることを示しています。
私たちのコードとデータは、https：//github.com/thomlake/investigating-alignmentで入手できます。

要約(オリジナル)

The alignment process changes several properties of a large language model’s (LLM’s) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at https://github.com/thomlake/investigating-alignment.

arxiv情報

著者	Thom Lake,Eunsol Choi,Greg Durrett
発行日	2025-05-12 16:11:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー