Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

要約

オンラインプラットフォームが成長するにつれて、コメントセクションは、ユーザーの経験と幸福を損なうハラスメントをますますホストしています。
この研究では、GAGE、ライフスタイル、フードVLOG、および音楽チャネルの高虐待スレッドからサンプリングされた5,080 YouTubeコメントのコーパスで、Openai GPT-4.1、Google Gpt-4.1、Google Gemini 1.5 Pro、および人類のClaude 3 Opusの3つの主要な大手言語モデルをベンチマークしています。
データセットは、英語、アラビア語、インドネシア語の1,334の有害なメッセージと3,746の非装備のメッセージで構成されており、実質的な合意で2人のレビュアーによって独立して注釈が付けられています（Cohen’s Kappa = 0.83）。
統一されたプロンプトと決定論的設定を使用して、GPT-4.1は、F1スコア0.863、0.887の精度、および0.841のリコールで最高の全体的なバランスを達成しました。
ジェミニは、有害なポストの最大シェア（Recall = 0.875）にフラグを立てましたが、頻繁な誤検知のため、その精度は0.767に低下しました。
クロードは、0.920で最高の精度と0.022の最低の偽陽性率を提供しましたが、そのリコールは0.720に低下しました。
定性分析は、3つのモデルすべてが皮肉、コード化されたin辱、および混合言語のスラングに苦労していることを示しました。
これらの結果は、補完的なモデルを組み合わせ、会話のコンテキストを組み込み、過小評価された言語と暗黙の乱用のための微調整を組み合わせた節度パイプラインの必要性を強調しています。
データセットと完全なプロンプトの識別されたバージョンが公開され、自動化されたコンテンツモデレーションの再現性とさらなる進捗を促進するために公開されています。

要約(オリジナル)

As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen’s kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.

arxiv情報

著者	Amel Muminovic
発行日	2025-05-28 16:18:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー