Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

要約

オンラインのヘイトスピーチを軽減するための対応として定義されるカウンタースピーチは、検閲を行わない解決策として使用されることが増えています。
ヘイトスピーチに効果的に対処するには、短い一文の発言や罵倒に微妙に暗示されている固定観念、偏見、偏見を払拭する必要があります。
これらの暗黙的な式は、通常、コンテキストが長いほどモデルのパフォーマンスが優れるため、特に seq2seq タスクにおいて、言語モデルに課題をもたらします。
私たちの研究では、ヘイト発言の社会的偏見の根底にある実際的な意味をモデル化することで、カウンタースピーチの生成を強化する新しいフレームワークであるCoARLを導入しました。
CoARL の最初の 2 つのフェーズには、順次複数命令のチューニングが含まれ、攻撃的な発言の意図、反応、害を理解するようにモデルを学習し、その後、意図条件付きの反論を生成するためのタスク固有の下位アダプターの重みを学習します。
最終フェーズでは、強化学習を使用して、有効性と無毒性のために出力を微調整します。
CoARL は、意図条件付きの反論生成において既存のベンチマークを上回り、意図の適合性で 3 ポイント、議論の質の指標で 4 ポイントの平均改善を示しています。
広範な人による評価は、ChatGPT のような著名な LLM を含む既存のシステムと比較して、優れた、よりコンテキストに適した応答を生成する CoARL の有効性を裏付けています。

要約(オリジナル)

Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. Addressing hate speech effectively involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These implicit expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. CoARL’s first two phases involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and non-toxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of 3 points in intent-conformity and 4 points in argument-quality metrics. Extensive human evaluation supports CoARL’s efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT.

arxiv情報

著者	Amey Hengle,Aswini Kumar,Sahajpreet Singh,Anil Bandhakavi,Md Shad Akhtar,Tanmoy Chakroborty
発行日	2024-03-15 08:03:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー