Testing Hateful Speeches against Policies

要約

近年、多くのソフトウェアシステムにAI技術、特にディープラーニング技術が採用されています。
AI ベースのシステムは、そのブラックボックスの性質により、トレーサビリティに課題をもたらしました。これは、AI システムの動作がモデルとデータに基づいているのに対し、要件やポリシーは自然言語またはプログラミング言語の形式のルールであるためです。
私たちの知る限り、AI およびディープニューラルネットワークベースのシステムがルールベースの要件/ポリシーに対してどのように動作するかについての研究は限られています。
このエクスペリエンスペーパーでは、自然言語ポリシーに記述されているルールベースの要件に対するディープニューラルネットワークの動作を検証します。
特に、AI ベースのコンテンツモデレーションソフトウェアをコンテンツモデレーションポリシーに照らしてチェックするケーススタディに焦点を当てます。
まず、クラウドソーシングを使用して、各モデレーションポリシーに一致する自然言語テストケースを収集し、このデータセットに HateModerate という名前を付けます。
次に、HateModerate のテストケースを使用して、最先端のヘイトスピーチ検出ソフトウェアの失敗率をテストしたところ、これらのモデルは特定のポリシーに対して失敗率が高いことがわかりました。
最後に、手動ラベル付けにはコストがかかるため、OpenAI の大規模な言語モデルを微調整して、新しい例をポリシーに自動的に照合することで HateModerate を強化する自動化アプローチをさらに提案しました。
この研究のデータセットとコードは、匿名の Web サイト \url{https://sites.google.com/view/content-moderation-project} で見つけることができます。

要約(オリジナル)

In the recent years, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems brought challenges to traceability, because AI system behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases which match each moderation policy, we name this dataset HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software, and we find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further proposed an automated approach to augument HateModerate by finetuning OpenAI’s large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: \url{https://sites.google.com/view/content-moderation-project}.

arxiv情報

著者	Jiangrui Zheng,Xueqing Liu,Girish Budhrani,Wei Yang,Ravishka Rathnasuriya
発行日	2023-07-23 20:08:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Testing Hateful Speeches against Policies

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー