Towards Conceptualization of ‘Fair Explanation’: Disparate Impacts of anti-Asian Hate Speech Explanations on Content Moderators

要約

AI の説明可能性と公平性の交差点に関する最近の研究は、公平性の尺度によって評価される人間と AI のタスクのパフォーマンスを説明によってどのように改善できるかに焦点を当てています。
私たちは、それ自体が「公平」である説明、つまり特定の集団に悪影響を及ぼさない説明を構成するものを特徴付けることを提案します。
私たちは、精度とラベル時間だけでなく、多くの指標 (精神的不快感、固定観念の活性化、知覚される作業負荷) にわたるさまざまなユーザーグループに対する説明の心理的影響を使用して、「公平な説明」の新しい評価方法を策定します。
私たちはこの方法を、説明アプローチ（顕著性マップと反事実の説明）全体で、潜在的なヘイトスピーチのコンテンツモデレーションと、アジア系と非アジア系の代理モデレータに対するその影響の違いという文脈で適用します。
一般に顕著性マップの方が優れたパフォーマンスを示し、反事実的な説明よりも異質な影響 (グループ) や個人の不公平さの証拠が少ないことがわかりました。
内容に関する警告: この文書にはヘイトスピーチや人種差別的な言葉の例が含まれています。
著者はそのようなコンテンツをサポートしません。
読み続ける前に、不快感のリスクを慎重に考慮してください。

要約(オリジナル)

Recent research at the intersection of AI explainability and fairness has focused on how explanations can improve human-plus-AI task performance as assessed by fairness measures. We propose to characterize what constitutes an explanation that is itself ‘fair’ — an explanation that does not adversely impact specific populations. We formulate a novel evaluation method of ‘fair explanations’ using not just accuracy and label time, but also psychological impact of explanations on different user groups across many metrics (mental discomfort, stereotype activation, and perceived workload). We apply this method in the context of content moderation of potential hate speech, and its differential impact on Asian vs. non-Asian proxy moderators, across explanation approaches (saliency map and counterfactual explanation). We find that saliency maps generally perform better and show less evidence of disparate impact (group) and individual unfairness than counterfactual explanations. Content warning: This paper contains examples of hate speech and racially discriminatory language. The authors do not support such content. Please consider your risk of discomfort carefully before continuing reading!

arxiv情報

著者	Tin Nguyen,Jiannan Xu,Aayushi Roy,Hal Daumé III,Marine Carpuat
発行日	2023-10-23 15:57:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Conceptualization of ‘Fair Explanation’: Disparate Impacts of anti-Asian Hate Speech Explanations on Content Moderators

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー