CALM: Curiosity-Driven Auditing for Large Language Models

要約

大規模言語モデル (LLM) の監査は、重要かつ困難なタスクです。
この研究では、パラメータにはアクセスせず、提供されるサービスのみにアクセスするブラックボックス LLM を監査することに焦点を当てています。
私たちは、このタイプの監査をブラックボックス最適化問題として扱います。その目的は、違法、非道徳的、または危険な動作を示すターゲット LLM の入出力ペアを自動的に発見することです。
たとえば、ターゲット LLM が有毒な出力で応答する非毒性の入力、または政治的に敏感な個人を含むターゲット LLM からの幻覚反応を誘発する入力を求める場合があります。
このブラックボックス最適化は、実行可能な点の不足、プロンプト空間の離散的な性質、および大きな検索空間のため、困難です。
これらの課題に対処するために、私たちは大規模言語モデルに対する好奇心駆動監査 (CALM) を提案します。これは、本質的に動機づけられた強化学習を使用して監査エージェントとして LLM を微調整し、ターゲット LLM の潜在的な有害で偏った入出力ペアを明らかにします。
CALM は、有名人に関する軽蔑的な入力を特定し、ブラックボックス設定で特定の名前を引き出す入力を明らかにします。
この研究は、ブラックボックス LLM を監査するための有望な方向性を提供します。
私たちのコードは https://github.com/x-zheng16/CALM.git で入手できます。

要約(オリジナル)

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.

arxiv情報

著者	Xiang Zheng,Longxiang Wang,Yi Liu,Xingjun Ma,Chao Shen,Cong Wang
発行日	2025-01-06 13:14:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CALM: Curiosity-Driven Auditing for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー