SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

要約

DeepSeek-R1モデルなどの新たな大きな推論モデル（LRMS）は、構造化された中間ステップを生成し、推論能力を高めるために長い考え方（COT）の推論を活用します。
ただし、長いCOTは本質的に安全な出力を保証するものではなく、コードのセキュリティの脆弱性の導入や誤報の拡大などの有害な結果につながる可能性があります。
大規模な言語モデル（LLM）の安全性に関する現在の研究は、通常、LRMSの長いCOTスタイルの出力を見落とす短い回答の応答に焦点を当てています。
このギャップを埋めるために、LRMの安全性に関する体系的な研究を実施します。
まず、人間の注釈に対して較正された安全性評価者を調査します。
新しく開発されたメトリックを使用して、StrongRejectおよびWildjailbreakデータセットで12の最先端のLRMの安全性を徹底的に評価します。
私たちの結果は、LRMが推論の進歩と比較して安全ではないことを示しています。
さらに、推論の痕跡と最終的な答えのきめの細かい分析を実行します。
3つのデコード戦略など、ゼロチンク、テンクレス、およびモレチンクキャンは、追加のトレーニングなしでモデルの安全性を改善することがわかります。
ただし、これらの戦略は、制約された推論トレースを使用するか、高い推論コストを負担します。
LRMの安全性を向上させるために、COTスタイルの初めての安全トレーニングデータセットであるSafeChainを紹介します。
SafeChainで2つのLRMSを微調整して、モデルの安全性を高めるだけでなく、6つの推論ベンチマークでパフォーマンスを保存することを示しています。

要約(オリジナル)

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

arxiv情報

著者	Fengqing Jiang,Zhangchen Xu,Yuetai Li,Luyao Niu,Zhen Xiang,Bo Li,Bill Yuchen Lin,Radha Poovendran
発行日	2025-02-17 16:57:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー