TaeBench: Improving Quality of Toxic Adversarial Examples

要約

毒性テキスト検出器は、敵対的な例に対して脆弱になる可能性があります – システムを誤った検出に欺くテキストを入力するための小さな摂動。
既存の攻撃アルゴリズムは時間がかかり、しばしば無効または曖昧な敵対的な例を生成し、実際の毒性含有量モデレーターの評価または改善に役立ちません。
このペーパーでは、生成された有毒敵の例（TAE）の品質管理のための注釈パイプラインを提案します。
モデルベースの自動注釈と人間ベースの品質検証を設計して、TAEの品質要件を評価しています。
成功したTAEは、標的毒性モデルをだまして、良性の予測を行い、文法的に合理的になり、人間で生成されたテキストのように自然に見えるようになり、セマンティック毒性を示す必要があります。
これらの要件を20を超える最先端の（SOTA）TAE攻撃レシピに適用すると、合計940kの生のTAE攻撃世代から多くの無効なサンプルが見つかります。
次に、提案されたパイプラインを利用して、Taebenchと呼ばれる高品質のTAEデータセット（サイズ264K）をフィルタリングおよびキュレートします。
経験的には、TaebenchがSOTA毒性コンテンツのモデレートモデルとサービスを効果的に転送できることを実証します。
また、我々の実験は、敵対的な訓練を受けたテベンチが2つの毒性検出器の堅牢性の大幅な改善を達成することを示しています。

要約(オリジナル)

Toxicity text detectors can be vulnerable to adversarial examples – small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. Successful TAE should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.

arxiv情報

著者	Xuan Zhu,Dmitriy Bespalov,Liwen You,Ninad Kulkarni,Yanjun Qi
発行日	2025-05-01 02:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TaeBench: Improving Quality of Toxic Adversarial Examples

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー