Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing

要約

安全性のアライメントは、実際のAIアプリケーションにとって不可欠な研究テーマである。AIにおける安全性と信頼性は多面的であるにもかかわらず、現在の安全性アライメント手法は、多くの場合、安全性の包括的な概念に焦点を当てている。既存の安全性アライメント手法のモデルを注意深く評価することで、それらは概して全体的な安全性パフォーマンスを向上させる一方で、特定のカテゴリーにおける安全性を確保できていないことがわかった。我々の研究ではまず、モデルの有用性を犠牲にすることなく、そのような脆弱性を排除することの難しさを明らかにした。KLペナルティパラメータを小さくし、学習反復回数を増やし、データセットをクレンジングすれば安全性は向上するが、安全性と有用性のトレードオフは必ずしも改善されない。我々は、安全性の調整が望ましくない効果を引き起こし、入力文脈に関わらず、拒否的な応答につながる否定的なトークンを生成することを好むモデルになる可能性さえあることを発見した。この問題に対処するため、ランダムに構成されたプロンプトを用いた生成過程において、このバイアスを推定し修正する学習不要の手法、トークンレベル安全性偏向推論（TSDI）を導入した。我々の実験により、本手法が安全性を維持しつつモデルの有用性を向上させ、トレードオフのパレートフロントを改善できることが実証された。

要約(オリジナル)

Safety alignment is an essential research topic for real-world AI applications. Despite the multifaceted nature of safety and trustworthiness in AI, current safety alignment methods often focus on a comprehensive notion of safety. By carefully assessing models from the existing safety-alignment methods, we found that, while they generally improved overall safety performance, they failed to ensure safety in specific categories. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model’s helpfulness. We observed that, while smaller KL penalty parameters, increased training iterations, and dataset cleansing can enhance safety, they do not necessarily improve the trade-off between safety and helpfulness. We discovered that safety alignment could even induce undesired effects and result in a model that prefers generating negative tokens leading to rejective responses, regardless of the input context. To address this, we introduced a learning-free method, Token-level Safety-Debiased Inference (TSDI), to estimate and correct this bias during the generation process using randomly constructed prompts. Our experiments demonstrated that our method could enhance the model’s helpfulness while maintaining safety, thus improving the trade-off Pareto-front.

arxiv情報

著者	Thien Q. Tran,Akifumi Wachi,Rei Sato,Takumi Tanabe,Youhei Akimoto
発行日	2025-02-04 09:31:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー