AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

要約

脱獄攻撃に対するLLMの堅牢性は、ユーザーが安全対策を回避し、モデル能力を誤用するようにプロンプトを設計し、主に単純なチャットボットとして機能するLLMについて研究されています。
一方、外部ツールを使用し、マルチステージタスクを実行できるLLMエージェントは、悪用された場合、より大きなリスクをもたらす可能性がありますが、その堅牢性は採用されていないままです。
LLMエージェントの誤用に関する研究を促進するために、AgentHarmと呼ばれる新しいベンチマークを提案します。
ベンチマークには、110の明示的に悪意のあるエージェントタスク（増強付き440）の多様なセットが含まれており、詐欺、サイバー犯罪、嫌がらせを含む11の害カテゴリをカバーしています。
モデルが有害なエージェントリクエストを拒否するかどうかを測定することに加えて、AgentHarmで得点するには、マルチステップタスクを完了するために攻撃に続いて能力を維持するためにJailbrokenエージェントが必要です。
さまざまな主要なLLMを評価し、（1）主要なLLMは、脱獄せずに悪意のあるエージェント要求に驚くほど準拠していることを発見します。
LLMベースのエージェントの攻撃と防御のシンプルで信頼できる評価を可能にするために、https://huggingface.co/datasets/ai-safety-institute/agentharmでAgentharmを公開します。

要約(オリジナル)

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents — which use external tools and can execute multi-stage tasks — may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.

arxiv情報

著者	Maksym Andriushchenko,Alexandra Souly,Mateusz Dziemian,Derek Duenas,Maxwell Lin,Justin Wang,Dan Hendrycks,Andy Zou,Zico Kolter,Matt Fredrikson,Eric Winsor,Jerome Wynne,Yarin Gal,Xander Davies
発行日	2025-04-18 14:30:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー