SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

要約

理想的なAIの安全性節度システムは、構造的に解釈可能であり（そのため、その決定は確実に説明できます）、現在のシステムが不足している（安全基準に合わせてコミュニティの価値を反映する）操縦可能（コミュニティの価値を反映する）の両方です。
このギャップに対処するために、斬新なAI安全性モデレートフレームワークであるSafetyAnalystを提示します。
AIの動作を考えると、SafetyAnalythは、有害で有益な行動を列挙し、AIの挙動に影響を与える可能性のある、潜在的で有益なアクションを列挙する構造化された「害のあるツリー」を作成することにより、潜在的な結果を分析するために考えられたチェーンの推論を使用します。
利害関係者への潜在的な影響を説明する即時性ラベル。
その後、SafetyAnalystは、すべての有害で有益な効果を、特定の安全性の好みに合わせて、完全に解釈可能な重量パラメーターを使用して、Harmfultionsスコアに集約します。
この概念的なフレームワークを適用して、19KプロンプトでフロンティアLLMSによって生成された1850万人のハートベネフィット機能から蒸留されたオープンソースLLMプロンプトの安全性分類システムを開発、テスト、およびリリースしました。
包括的な迅速な安全ベンチマークのセットでは、Safetyreporter（平均F1 = 0.81）が、迅速な安全分類で既存のLLM安全モデレーションシステム（平均F1 $ $ $ 0.72）を上回り、解釈可能性、透明性、およびステアリビリティの追加の利点を提供することを示します。

要約(オリジナル)

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community’s values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured ‘harm-benefit tree,’ which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impact on any stakeholders. SafetyAnalyst then aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this conceptual framework to develop, test, and release an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On a comprehensive set of prompt safety benchmarks, we show that SafetyReporter (average F1=0.81) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

arxiv情報

著者	Jing-Jing Li,Valentina Pyatkin,Max Kleiman-Weiner,Liwei Jiang,Nouha Dziri,Anne G. E. Collins,Jana Schaich Borg,Maarten Sap,Yejin Choi,Sydney Levine
発行日	2025-01-31 18:01:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー