What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

要約

大規模言語モデル (LLM) は、応答を拒否した場合でも社会人口学的バイアスを示しますか?
「話す」ことへの彼らの拒否を回避するために、私たちは文脈化された埋め込みを調査し、このバイアスが潜在的な表現にコード化されているかどうかを調査することによって、この研究課題を研究します。
我々は、単語の隠れベクトルから LLM の単語ペアの好みを予測するロジスティック Bradley-Terry プローブを提案します。
まず、3 つのペア優先タスクと 13 個の LLM でプローブを検証します。このテストでは、暗黙的な関連性のテストにおける標準的なアプローチである単語埋め込み関連性テスト (WEAT) よりもエラー率が相対的に 27% 優れています。
また、単語ペアの好みは中間層で最もよく表されることもわかりました。
次に、国籍、政治、宗教、性別における偏見を調べるために、無害なタスク (大きい方を選択するなど) で訓練されたプローブを物議を醸すタスク (民族の比較) に移します。
私たちは、すべての対象層に大きな偏りがあることを観察しています。たとえば、ミストラルモデルは、回答を拒否しているにもかかわらず、暗黙のうちにアフリカよりヨーロッパ、ユダヤ教よりキリスト教、右翼政治より左翼を好みます。
これは、命令の微調整が必ずしもコンテキスト化された埋め込みをバイアスしないことを示唆しています。
私たちのコードベースは https://github.com/castorini/biasprobe にあります。

要約(オリジナル)

Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? To bypass their refusal to ‘speak,’ we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words’ hidden vectors. We first validate our probe on three pair preference tasks and thirteen LLMs, where we outperform the word embedding association test (WEAT), a standard approach in testing for implicit association, by a relative 27% in error rate. We also find that word pair preferences are best represented in the middle layers. Next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. We observe substantial bias for all target classes: for instance, the Mistral model implicitly prefers Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics, despite declining to answer. This suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. Our codebase is at https://github.com/castorini/biasprobe.

arxiv情報

著者	Raphael Tang,Xinyu Zhang,Jimmy Lin,Ferhan Ture
発行日	2023-11-30 18:53:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー