ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation

要約

大規模言語モデル (LLM) が人間の目標、価値観、意図と一貫して動作することを保証することは、安全性を確保するために重要ですが、それでも計算コストがかかります。
特に膨大な数のパラメータを持つ LLM のアライメントトレーニングの計算コストを削減し、学習した値のアライメントを再利用するために、概念移植を介して弱から強へのアライメントの転送を可能にする新しいフレームワークである ConTrans を提案します。
表現エンジニアリングの観点から、ConTrans はソース LLM (通常は弱いが調整された LLM) からの値の調整において概念ベクトルを洗練します。
次に、洗練された概念ベクトルは、アフィン変換を介してターゲット LLM (通常は強力だが位置合わせされていないベース LLM) に適応するように再定式化されます。
3 番目のステップでは、ConTrans は再定式化された概念ベクトルをターゲット LLM の残りのストリームに移植します。
実験では、7B モデルから 13B および 70B モデルに至るまで、複数の LLM および LLM ファミリにわたる幅広い整合概念の移植が成功していることが実証されています。
注目すべきことに、ConTrans は、信頼性の点で命令調整モデルをも上回っています。
実験結果は、LLM ファミリー間および LLM ファミリー内両方の概念移植の有効性を検証します。
私たちの研究は、弱から強へのアライメントの一般化と制御を達成するための代替方法を実証することに成功しました。

要約(オリジナル)

Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.

arxiv情報

著者	Weilong Dong,Xinwei Wu,Renren Jin,Shaoyang Xu,Deyi Xiong
発行日	2024-12-30 07:25:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー