Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models


120 を超える論文を調査した私たちの広範な調査では、言語モデルの固有の機能に基づいたきめの細かい攻撃戦略の分類が導入されています。
さらに、私たちの調査では、マルチモーダルな攻撃と防御、LLM ベースのエージェントに関するリスク、無害なクエリの過剰な処理、無害性と有用性のバランスなどの新しい領域もカバーしています。


Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the ‘searcher’ framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.


著者 Lizhi Lin,Honglin Mu,Zenan Zhai,Minghan Wang,Yuxia Wang,Renxi Wang,Junjie Gao,Yixuan Zhang,Wanxiang Che,Timothy Baldwin,Xudong Han,Haonan Li
発行日 2024-11-26 11:59:17+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.CL パーマリンク