NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

要約

有毒なコンテンツを持つオンラインテキストは、特にソーシャルメディアのユーザー、および一般的な社会にとって明らかな脅威です。
多くのプラットフォームは、効果を低下させるためにさまざまな手段（機械学習ベースの憎悪の発言検出システムなど）を採用していますが、有毒なコンテンツライターは、巧妙に修正された有毒な単語、いわゆる人間が執筆したテキスト摂動を使用して、そのような手段を回避しようとしました。
したがって、これらの摂動を認識するための自動検出ツールの構築を支援するために、以前の方法は、多様な敵対的なサンプルを生成するための洗練された技術を開発しました。
ただし、これらの「アルゴリズム」で生成された摂動は、必ずしも「人間」と書かれた摂動のすべての特性をキャプチャしないことに注意してください。
したがって、このホワイトペーパーでは、ループの人間によって書かれ、検証された現実の摂動から作成された、ノイズイと名付けられた人間が作られた摂動の斬新で高品質のデータセットを紹介します。
騒音の摂動は、以前のアルゴリズムで生成された有毒データセットが示すものとは異なる特性を持っているため、特により良い毒性音声検出ソリューションの開発に役立つことがあります。
BertやRobertaなどの最先端の言語モデルと、Perspective APIなどのブラックボックスAPIに対して、摂動の正規化や理解などの2つのタスクで、Noisyhateを徹底的に検証します。

要約(オリジナル)

Online texts with toxic content are a clear threat to the users on social media in particular and society in general. Although many platforms have adopted various measures (e.g., machine learning-based hate-speech detection systems) to diminish their effect, toxic content writers have also attempted to evade such measures by using cleverly modified toxic words, so-called human-written text perturbations. Therefore, to help build automatic detection tools to recognize those perturbations, prior methods have developed sophisticated techniques to generate diverse adversarial samples. However, we note that these “algorithms’-generated perturbations do not necessarily capture all the traits of “human’-written perturbations. Therefore, in this paper, we introduce a novel, high-quality dataset of human-written perturbations, named as NoisyHate, that was created from real-life perturbations that are both written and verified by human-in-the-loop. We show that perturbations in NoisyHate have different characteristics than prior algorithm-generated toxic datasets show, and thus can be in particular useful to help develop better toxic speech detection solutions. We thoroughly validate NoisyHate against state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as Perspective API, on two tasks, such as perturbation normalization and understanding.

arxiv情報

著者	Yiran Ye,Thai Le,Dongwon Lee
発行日	2025-04-28 15:25:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー