Internal Activation as the Polar Star for Steering Unsafe LLM Behavior

要約

大規模言語モデル（LLM）は、幅広いタスクにおいて卓越した能力を発揮しているが、有害なコンテンツを生成する可能性があるため、重大なリスクもはらんでいる。既存の安全メカニズムはモデルの安全性を向上させることができますが、多くの場合、過度に慎重な振る舞いにつながり、LLMの内部認知プロセスを十分に活用することができません。我々は、人間が言語と行動を制御するために反省的推論（システム2思考）に依存している認知科学からヒントを得て、LLMも内部評価と制御のための同様の能力を持っており、それを能動的に検出できることを実証的に証明する。この洞察に基づき、モデルの内部状態を監視し利用することで、安全でない出力を動的に制御するフレームワークであるSafeSwitchを紹介する。我々の実証結果は、SafeSwitchが強力な実用性を維持しながら、安全ベンチマークにおいて有害な出力を80%以上削減することを示している。従来の安全アライメント手法と比較して、SafeSwitchはより有益でコンテキストを意識したリフューザルを提供し、未知のクエリに対する耐性を示し、元のパラメータのわずか6%未満のチューニングでこれらの利点を達成する。これらの特徴により、SafeSwitchはLLMに微妙な安全制御を実装するための有望なアプローチとなる。

要約(オリジナル)

Large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks but also pose significant risks due to their potential to generate harmful content. Although existing safety mechanisms can improve model safety, they often lead to overly cautious behavior and fail to fully utilize LLMs’ internal cognitive processes. Drawing inspiration from cognitive science, where humans rely on reflective reasoning (System 2 thinking) to regulate language and behavior, we empirically demonstrate that LLMs also possess a similar capacity for internal assessment and regulation, which can be actively detected. Building on this insight, we introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model’s internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility. Compared to traditional safety alignment methods, SafeSwitch delivers more informative and context-aware refusals, demonstrates resilience to unseen queries, and achieves these benefits while only tuning less than 6% of the original parameters. These features make SafeSwitch a promising approach for implementing nuanced safety controls in LLMs.

arxiv情報

著者	Peixuan Han,Cheng Qian,Xiusi Chen,Yuji Zhang,Denghui Zhang,Heng Ji
発行日	2025-02-04 16:47:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Internal Activation as the Polar Star for Steering Unsafe LLM Behavior

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー