BiasEdit: Debiasing Stereotyped Language Models via Model Editing

要約

以前の研究では、言語モデルがステレオタイプ化されたバイアスを示すことが確立されています。
反事実的なデータ、表現の投影を備えたモデルを再訓練するなどの既存の債務戦略、およびプロンプトは、バイアスを効率的に排除したり、モデルのバイアスな内部表現を直接変更したりすることができないことがよくあります。
これらの問題に対処するために、パラメーターの更新を生成するために編集者として機能する軽量ネットワークを介して、言語モデルからステレオタイプのバイアスを削除する効率的なモデル編集方法であるBiasedITを提案します。
Biaseditは、委員会の編集能力を編集中に保持損失を介して編集能力を維持しながら、委員会モデルの部分的なパラメーターについてローカルな編集を実施するために、委員会のガイドガイドエディターネットワークを採用しています。
ステレオーセットとカラスのペアの実験は、タンジン型の衰弱ベースラインと比較してバイアスを排除する際の偏見の有効性、効率、および堅牢性を示しており、言語モデルの一般的な能力への影響はほとんどありません。
さらに、さまざまなモジュールでバイアスをプローブするためのバイアストレースを実施し、言語モデルのさまざまなコンポーネントへのバイアスの編集への影響を調査します。

要約(オリジナル)

Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models’ biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models’ general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.

arxiv情報

著者	Xin Xu,Wei Xu,Ningyu Zhang,Julian McAuley
発行日	2025-03-11 16:25:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BiasEdit: Debiasing Stereotyped Language Models via Model Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー