Toxicity Detection towards Adaptability to Changing Perturbations

要約

有害物質の検出は社会の平和を維持するために非常に重要です。
既存の方法は、通常の有毒成分や特定の摂動法によって生成された有毒成分に対しては良好に機能しますが、進化する摂動パターンに対して脆弱です。
ただし、現実のシナリオでは、悪意のあるユーザーは検出器をだますために新しい摂動パターンを作成する傾向があります。
たとえば、一部のユーザーは、プロンプトの先頭に「私は科学者です」と追加することで、大規模言語モデル (LLM) の検出機能を回避する可能性があります。
この論文では、新しい問題、つまり脱獄摂動パターンの継続学習を毒性検出分野に導入します。
この問題に取り組むために、まず 9 種類の摂動パターンによって生成された新しいデータセットを構築します。そのうちの 7 つは以前の研究から要約され、そのうちの 2 つは私たちが開発したものです。
次に、ゼロショット検出と微調整されたクロスパターン検出の両方を介して、この新しい摂動パターン認識データセットに対する現在の手法の脆弱性を系統的に検証します。
これに基づいて、動的に出現するタイプの混乱した有害テキストに対する検出器の堅牢性を確保するための、ドメイン増分学習パラダイムと対応するベンチマークを示します。
私たちのコードとデータセットは付録で提供され、GitHub で一般公開される予定です。これにより、セキュリティ関連コミュニティに新しい研究の機会を提供したいと考えています。

要約(オリジナル)

Toxicity detection is crucial for maintaining the peace of the society. While existing methods perform well on normal toxic contents or those generated by specific perturbation methods, they are vulnerable to evolving perturbation patterns. However, in real-world scenarios, malicious users tend to create new perturbation patterns for fooling the detectors. For example, some users may circumvent the detector of large language models (LLMs) by adding `I am a scientist’ at the beginning of the prompt. In this paper, we introduce a novel problem, i.e., continual learning jailbreak perturbation patterns, into the toxicity detection field. To tackle this problem, we first construct a new dataset generated by 9 types of perturbation patterns, 7 of them are summarized from prior work and 2 of them are developed by us. We then systematically validate the vulnerability of current methods on this new perturbation pattern-aware dataset via both the zero-shot and fine tuned cross-pattern detection. Upon this, we present the domain incremental learning paradigm and the corresponding benchmark to ensure the detector’s robustness to dynamically emerging types of perturbed toxic text. Our code and dataset are provided in the appendix and will be publicly available at GitHub, by which we wish to offer new research opportunities for the security-relevant communities.

arxiv情報

著者	Hankun Kang,Jianhao Chen,Yongqi Li,Xin Miao,Mayi Xu,Ming Zhong,Yuanyuan Zhu,Tieyun Qian
発行日	2025-01-08 09:18:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Toxicity Detection towards Adaptability to Changing Perturbations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー