Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models

要約

推論ベースの言語モデルは、さまざまなドメインで強力なパフォーマンスを実証しており、数学的およびコーディングタスクで最も顕著な利益が見られます。
最近の研究では、推論がLLMの安全性とガードレールアプリケーションにも大きな利点を提供することが示されています。
この作業では、推論時にカスタム安全ポリシーへの一般化に重点を置いて、コンテンツモデレートのためのトレーニング推論ベースのガードレールモデルの包括的な分析を実施します。
私たちの研究は、データの効率と推論効率という2つの重要な側面に焦点を当てています。
データの面では、推論ベースのモデルが強力なサンプル効率を示し、競争力のないパフォーマンスを達成し、非合理的なカウンターパートよりもかなり少ないトレーニング例を達成していることがわかります。
これにより、モデルのパフォーマンスをさらに向上させる高価値の困難なサンプルをマイニングするために、残りのデータを再利用する可能性が解除されます。
推論側では、推論予算を導入し、推論の長さが遅延と精度に与える影響を調べ、デュアルモードトレーニングを調査して、合理的な動作を実行するためのデュアルモードトレーニングを調査することにより、実用的なトレードオフを評価します。
私たちの調査結果は、研究者と開発者が現実世界のシステムで推論ベースのガードレールモデルを効果的かつ効率的に訓練および展開するための実用的な洞察を提供します。

要約(オリジナル)

Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.

arxiv情報

著者	Makesh Narsimhan Sreedhar,Traian Rebedea,Christopher Parisien
発行日	2025-05-26 15:01:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー