GuardReasoner: Towards Reasoning-based LLM Safeguards

要約

LLMSが安全性の高いアプリケーションにますます影響を与えるため、Guardrailsを使用して安全を確保することは重要な課題です。
このペーパーでは、LLMSの新しいセーフガードであるGuardReasonerを提案します。
具体的には、最初に、460kの詳細な推論ステップを持つ127kのサンプルで構成されるGuardReasonerTrainデータセットを作成します。
次に、ガードモデルの推論能力のロックを解除するためにSFTを導入します。
さらに、ハードサンプルDPOを提示して、推論能力をさらに強化します。
このようにして、GuardReasonerは、パフォーマンス、説明、一般化を改善します。
3つのガードレールタスクの13のベンチマークでの広範な実験と分析は、その優位性を示しています。
驚くべきことに、GuardReasoner 8BはGPT-4O+COTを5.74％、Llama Guard 3 8Bは平均20.84％F1スコアを上回ります。
GuardReasonerの異なるスケール（1b、3b、8b）のトレーニングデータ、コード、およびモデルをリリースします：https：//github.com/yueliu1999/guardreasoner/。

要約(オリジナル)

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.

arxiv情報

著者	Yue Liu,Hongcheng Gao,Shengfang Zhai,Jun Xia,Tianyi Wu,Zhiwei Xue,Yulin Chen,Kenji Kawaguchi,Jiaheng Zhang,Bryan Hooi
発行日	2025-01-30 17:06:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GuardReasoner: Towards Reasoning-based LLM Safeguards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー