Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

要約

汎用命令チューニングされたカウンターパートと比較して、特定の技術分野で優れたトレーニングドメインエクスペルLLMに関心が高まっています。
ただし、これらの専門モデルは、プロセスで安全能力の損失を経験し、有害なコンテンツを生成できることがよくあります。
解決策として、ドメインとアライメントベクターを補間する\ textSc {mergealign}と呼ばれる効率的かつ効果的なマージベースのアライメントメソッドを導入し、ユーティリティを維持しながらより安全なドメイン固有のモデルを作成します。
医学と金融の専門家であるLlama3バリアントに\ textSc {mergeAlign}を適用し、ドメイン固有のベンチマークで最小限から分解しないことで実質的な整合性の改善を取得します。
モデルの類似性メトリックと統合されている個々のモデルの貢献を通じてモデルマージの影響を研究します。
私たちの調査結果は、新しい研究道を開き、安全な専門家LLMのより効率的な開発を促すことを願っています。

要約(オリジナル)

There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.

arxiv情報

著者	Megh Thakkar,Quentin Fournier,Matthew Riemer,Pin-Yu Chen,Amal Zouaq,Payel Das,Sarath Chandar
発行日	2025-05-30 17:17:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー