Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

要約

Deepseek-R1は、その卓越した推論能力とオープンソース戦略で有名であり、グローバルな人工知能環境に大きな影響を与えています。
ただし、顕著な安全性の欠点を示します。
ペンシルベニア大学と協力して、シスコの子会社であるRobust Intelligenceが実施した最近の調査により、Deepseek-R1は有害なプロンプトを処理する際に100 \％の攻撃成功率を達成することが明らかになりました。
さらに、複数のセキュリティ企業と研究機関が、モデル内の重要なセキュリティの脆弱性を特定しています。
中国ユニコムは中国の文脈におけるR1の安全脆弱性を明らかにしていますが、R1シリーズの残りの蒸留モデルの安全能力はまだ包括的に評価されていません。
このギャップに対処するために、この研究では、包括的な中国の安全ベンチマークChisafetybenchを利用して、DeepSeek-R1シリーズ蒸留モデルの詳細な安全評価を実施しています。
目的は、蒸留前後の両方で中国の文脈におけるこれらのモデルの安全能力を評価し、モデルの安全性に対する蒸留の悪影響をさらに解明することです。
これらの調査結果に基づいて、DeepSeek-R1モデルシリーズ全体にターゲットを絞った安全性向上を実装します。
評価の結果は、強化されたモデルが顕著な分解なしに推論能力を維持しながら、安全性の大幅な改善を達成することを示しています。
https://github.com/unicomai/deepseek-r1-safeで安全性を高めるモデルをオープンソースして、DeepSeekモデルの将来の研究と最適化の貴重なリソースとして機能します。

要約(オリジナル)

DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100\% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for the entire DeepSeek-R1 model series. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Safe to serve as a valuable resource for future research and optimization of DeepSeek models.

arxiv情報

著者	Wenjing Zhang,Xuejiao Lei,Zhaoxiang Liu,Limin Han,Jiaojiao Zhao,Junting Guo,Zhenhong Long,Shu Yang,Meijuan An,Beibei Huang,Rongjia Du,Ning Wang,Kai Wang,Shiguo Lian
発行日	2025-05-16 13:29:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー