Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing

要約

大規模言語モデル (LLM) は性別による偏見を示すことが多く、安全な導入に課題が生じています。
バイアスを軽減する既存の方法では、そのメカニズムを包括的に理解していないか、モデルの中核機能が損なわれています。
これらの問題に対処するために、LLM におけるジェンダーバイアスを体系的に評価するための CommonWords データセットを提案します。
私たちの分析により、モデル全体に広がるバイアスが明らかになり、この動作の原因となる、性別ニューロンや一般ニューロンを含む特定のニューロン回路が特定されます。
特に、少数の一般ニューロンを編集すると、階層的なニューロンの相互作用によりモデルの全体的な機能が混乱する可能性があります。
これらの洞察に基づいて、ロジットベースと因果ベースの戦略を組み合わせて、偏ったニューロンを選択的にターゲットにする、解釈可能なニューロン編集方法を提案します。
5 つの LLM での実験では、私たちの方法がモデルの元の機能を維持しながら性別による偏見を効果的に軽減し、既存の微調整および編集アプローチよりも優れていることが実証されました。
私たちの発見は、新しいデータセット、バイアスメカニズムの詳細な分析、LLM におけるジェンダーバイアスを軽減するための実用的な解決策に貢献します。

要約(オリジナル)

Large language models (LLMs) often exhibit gender bias, posing challenges for their safe deployment. Existing methods to mitigate bias lack a comprehensive understanding of its mechanisms or compromise the model’s core capabilities. To address these issues, we propose the CommonWords dataset, to systematically evaluate gender bias in LLMs. Our analysis reveals pervasive bias across models and identifies specific neuron circuits, including gender neurons and general neurons, responsible for this behavior. Notably, editing even a small number of general neurons can disrupt the model’s overall capabilities due to hierarchical neuron interactions. Based on these insights, we propose an interpretable neuron editing method that combines logit-based and causal-based strategies to selectively target biased neurons. Experiments on five LLMs demonstrate that our method effectively reduces gender bias while preserving the model’s original capabilities, outperforming existing fine-tuning and editing approaches. Our findings contribute a novel dataset, a detailed analysis of bias mechanisms, and a practical solution for mitigating gender bias in LLMs.

arxiv情報

著者	Zeping Yu,Sophia Ananiadou
発行日	2025-01-24 12:41:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー