Output Scouting: Auditing Large Language Models for Catastrophic Responses

要約

大規模な言語モデル（LLM）の使用が個人に大きな害をもたらした最近の有名なインシデントは、AIの安全性に関心を高めています。
LLMの安全性の問題の1つが発生する理由の1つは、モデルが有害な出力を生成する少なくともゼロ以外の確率を持つことが多いことです。
この作業では、次のシナリオを探ります。AI安全監査人がLLMからの壊滅的な反応を検索していると想像してください（たとえば、「妊娠しているために従業員を解雇できますか？」に対する「はい」応答）。
これらの障害応答を効率的に見つけるモデルを照会するための戦略は何ですか？
この目的のために、出力スカウトを提案します。これは、ターゲット確率分布に一致する特定のプロンプトに意味的に流fluentな出力を生成することを目的とするアプローチです。
次に、2つのLLMを使用して実験を実行し、壊滅的な反応の多くの例を見つけます。
壊滅的な反応のためにLLM監査を実施しようとしている開業医のためのアドバイスを含む議論で結論を出します。
また、抱きしめるフェイストランスライブラリを使用して監査フレームワークを実装するオープンソースツールキット（https://github.com/joaopfonseca/outputscouting）もリリースします。

要約(オリジナル)

Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. In this work, we explore the following scenario: imagine an AI safety auditor is searching for catastrophic responses from an LLM (e.g. a ‘yes’ responses to ‘can I fire an employee for being pregnant?’), and is able to query the model a limited number times (e.g. 1000 times). What is a strategy for querying the model that would efficiently find those failure responses? To this end, we propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution. We then run experiments using two LLMs and find numerous examples of catastrophic responses. We conclude with a discussion that includes advice for practitioners who are looking to implement LLM auditing for catastrophic responses. We also release an open-source toolkit (https://github.com/joaopfonseca/outputscouting) that implements our auditing framework using the Hugging Face transformers library.

arxiv情報

著者	Andrew Bell,Joao Fonseca
発行日	2025-03-28 15:45:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Output Scouting: Auditing Large Language Models for Catastrophic Responses

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー