Probing LLMs for hate speech detection: strengths and vulnerabilities

要約

最近、大規模な言語モデルを使用して嫌悪感のある言葉や有害な言葉を検出する研究者だけでなく、ソーシャルメディアプラットフォームも取り組んでいます。
ただし、これらの作品はいずれも、検出プロセスで説明、追加のコンテキスト、被害者コミュニティ情報を使用することを目的としていません。
さまざまなプロンプトバリエーションを利用し、情報を入力し、ゼロショット設定 (コンテキスト内の例を追加せずに) で大規模な言語モデルを評価します。
3 つの大きな言語モデル (GPT-3.5、text-davinci、Flan-T5) と 3 つのデータセット (HateXplain、implicit hat、ToxicSpans) を選択します。
パイプラインにターゲット情報を含めると、平均して、モデルのパフォーマンスがデータセット全体のベースラインと比べて大幅に (約 20 ～ 30%) 向上することがわかりました。
また、理論的根拠/説明をパイプラインに追加すると、データセット全体のベースラインに比べてかなりの効果が得られます (~10 ～ 20%)。
さらに、これらの大規模な言語モデルが (i) 分類できず、(ii) 決定の理由を説明できないエラーケースの類型も提供します。
このような脆弱な点は、これらのモデルの「脱獄」プロンプトを自動的に構成するため、そのようなプロンプトに対してモデルを堅牢にするために業界規模の保護技術を開発する必要があります。

要約(オリジナル)

Recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. However, none of these works aim to use explanation, additional context and victim community information in the detection process. We utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets – HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially (~20-30%) over the baseline across the datasets. There is also a considerable effect of adding the rationales/explanations into the pipeline (~10-20%) over the baseline across the datasets. In addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. Such vulnerable points automatically constitute ‘jailbreak’ prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.

arxiv情報

著者	Sarthak Roy,Ashish Harshavardhan,Animesh Mukherjee,Punyajoy Saha
発行日	2023-10-19 16:11:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probing LLMs for hate speech detection: strengths and vulnerabilities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー