HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

要約

自動化されたレッドチーム化は、大規模言語モデル (LLM) の悪意のある使用に関連するリスクを発見し、軽減する上で大きな期待を持っていますが、この分野には新しい手法を厳密に評価するための標準化された評価フレームワークがありません。
この問題に対処するために、自動化されたレッドチーム化のための標準化された評価フレームワークである HarmBench を導入します。
私たちは、これまでレッドチーム評価では考慮されていなかったいくつかの望ましい特性を特定し、これらの基準を満たすように HarmBench を体系的に設計します。
HarmBench を使用して、18 のレッドチーム手法と 33 のターゲット LLM および防御の大規模な比較を実施し、新たな洞察をもたらしました。
また、広範囲の攻撃に対する LLM の堅牢性を大幅に強化する非常に効率的な敵対的トレーニング方法も紹介し、HarmBench が攻撃と防御の共同開発をどのように可能にするかを示します。
HarmBench を https://github.com/centerforaisafety/HarmBench でオープンソースします。

要約(オリジナル)

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

arxiv情報

著者	Mantas Mazeika,Long Phan,Xuwang Yin,Andy Zou,Zifan Wang,Norman Mu,Elham Sakhaee,Nathaniel Li,Steven Basart,Bo Li,David Forsyth,Dan Hendrycks
発行日	2024-02-06 18:59:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー