Investigating Bias Representations in Llama 2 Chat via Activation Steering

要約

私たちは、Llama 2 7B チャットモデルに焦点を当てて、大規模言語モデル (LLM) における社会的偏見の課題に取り組みます。
LLM は社会に多大な影響を与える意思決定プロセスにますます組み込まれているため、これらのモデルが既存のバイアスを強化しないようにすることが不可欠になっています。
私たちのアプローチでは、アクティベーションステアリングを採用して、性別、人種、宗教に関連する偏見を調査し、軽減します。
このメソッドは、StereoSet データセットとカスタム GPT4 生成された性別バイアスプロンプトから派生したステアリングベクトルを利用して、モデルのアクティベーションを操作して、応答をバイアスされた出力に向けたり遠ざけたりします。
私たちの調査結果は、Llama 2 7B チャットに固有の性別バイアスがあり、これはヒューマンフィードバックからの強化学習 (RLHF) 後も存続していることを明らかにしました。
また、バイアスとモデルの応答を拒否する傾向との間には、予測可能な負の相関関係も観察されます。
重要なことに、私たちの研究は、RLHFがさまざまな形の社会バイアスのモデル表現における類似性を高める傾向があることを明らかにしており、これにより、さまざまな形のバイアスに対するモデルの微妙な理解について疑問が生じます。
この研究はまた、アクティベーションステアリングを使用した LLM の効果的なレッドチーム戦略についての貴重な洞察を提供し、特に拒否ベクトルを統合する重要性を強調しています。

要約(オリジナル)

We address the challenge of societal bias in Large Language Models (LLMs), focusing on the Llama 2 7B Chat model. As LLMs are increasingly integrated into decision-making processes with substantial societal impact, it becomes imperative to ensure these models do not reinforce existing biases. Our approach employs activation steering to probe for and mitigate biases related to gender, race, and religion. This method manipulates model activations to direct responses towards or away from biased outputs, utilizing steering vectors derived from the StereoSet dataset and custom GPT4 generated gender bias prompts. Our findings reveal inherent gender bias in Llama 2 7B Chat, persisting even after Reinforcement Learning from Human Feedback (RLHF). We also observe a predictable negative correlation between bias and the model’s tendency to refuse responses. Significantly, our study uncovers that RLHF tends to increase the similarity in the model’s representation of different forms of societal biases, which raises questions about the model’s nuanced understanding of different forms of bias. This work also provides valuable insights into effective red-teaming strategies for LLMs using activation steering, particularly emphasizing the importance of integrating a refusal vector.

arxiv情報

著者	Dawn Lu,Nina Rimsky
発行日	2024-02-01 07:48:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Investigating Bias Representations in Llama 2 Chat via Activation Steering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー