Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

要約

LLM のバイアスは、ユーザーエクスペリエンスや社会的成果に悪影響を与える可能性があります。
しかし、現在のバイアス緩和方法は、人間による集中的なフィードバックを必要とすることが多く、他のトピックへの移行性が欠けていたり、自信過剰でランダムな出力を生成したりすることがよくあります。
LLM をロールプレイングシナリオに参加させると、バイアスを認識して軽減する能力が向上することがわかりました。
これに基づいて、従来の RLHF における人間によるフィードバックに代わるバイアス軽減のための新しいアプローチである、フィードバックとしての多役割ディベートからの強化学習 (RLDF) を提案します。
マルチロールディベートで LLM を利用して、強化学習の報酬モデルをトレーニングするための高バイアスインスタンスと低バイアスインスタンスの両方を含むデータセットを作成します。
私たちのアプローチは 2 つのモードで構成されます: (1) 内省。同じ LLM が複数の役割の議論に参加します。(2) 教師と生徒。GPT-3.5-turbo のようなより高度な LLM が LLM をガイドしてこのタスクを実行します。
。
BBQ 上のさまざまな LLM と当社のデータセットにわたる実験結果は、バイアス緩和における当社のアプローチの有効性を示しています。
私たちのソースコードとデータセットは \texttt{https://anonymous.4open.science/r/RLDF-E344} で入手できます。

要約(オリジナル)

Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.

arxiv情報

著者	Ruoxi Cheng,Haoxuan Ma,Shuirong Cao,Jiaqi Li,Aihua Pei,Zhiqiang Wang,Pengliang Ji,Haoyu Wang,Jiaqi Huo
発行日	2024-08-16 12:20:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー