Tradeoffs Between Alignment and Helpfulness in Language Models

要約

言語モデルの調整は AI の安全性の重要な要素となっており、望ましい動作を強化し、望ましくない動作を抑制することで、人間と言語モデルの間の安全な相互作用を可能にします。
これは多くの場合、モデルを調整するか、プリセットの調整プロンプトを挿入することによって行われます。
最近、トレーニング後に表現を変更することでモデルの動作を変更する方法である表現エンジニアリングが、LLM の調整に効果的であることが示されました (Zou et al., 2023a)。
表現エンジニアリングは、敵対的攻撃への耐性や社会的偏見の軽減など、調整指向のタスクにおいて向上をもたらしますが、基本的なタスクを実行するモデルの能力の低下を引き起こすことも示されています。
この論文では、モデルの整合性の向上と有用性の低下との間のトレードオフを研究します。
我々は、これら 2 つの量に限界を与える理論的枠組みを提案し、それらの関連性を経験的に実証します。
興味深いことに、有用性は一般に減少しますが、表現エンジニアリングベクトルのノルムに合わせて二次関数的に減少する一方、アライメントはそれに比例して増加し、表現エンジニアリングを使用するのが効率的である状況を示していることがわかりました。
私たちは発見を経験的に検証し、位置合わせのための表現エンジニアリングの有用性の境界を図示します。

要約(オリジナル)

Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model’s behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. Interestingly, we find that while the helpfulness generally decreases, it does so quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.

arxiv情報

著者	Yotam Wolf,Noam Wies,Dorin Shteyman,Binyamin Rothberg,Yoav Levine,Amnon Shashua
発行日	2024-01-29 17:38:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tradeoffs Between Alignment and Helpfulness in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー