Homogeneity Bias as Differential Sampling Uncertainty in Language Models


トークンサンプリング分布の不確実性の3つの測定値 – エントロピー、困惑、および分化の確率 – 特にGPT-4ターボおよびラマ-3.2では、トークンが疎外されたグループに関するテキストを生成すると、トークンがより決定的にサンプリングされることがわかります(I.E.、


Prior research show that Large Language Models (LLMs) and Vision-Language Models (VLMs) represent marginalized groups more homogeneously than dominant groups. However, the mechanisms underlying this homogeneity bias remain relatively unexplored. We propose that this bias emerges from systematic differences in the probability distributions from which tokens are sampled at inference-time. Analyzing three measures of uncertainty in token sampling distributions-entropy, perplexity, and probability of differentiation-we find that in some models, specifically GPT-4 Turbo and Llama-3.2, tokens are sampled more deterministically when generating texts about marginalized groups (i.e., Black Americans and women) compared to their dominant group counterparts (i.e., White Americans and men). While these findings may help explain homogeneity bias in certain models, the patterns did not replicate across all VLMs tested, suggesting multiple mechanisms may contribute to homogeneity bias in AI.


著者 Messi H. J. Lee,Soyeon Jeon
発行日 2025-01-31 17:36:12+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV パーマリンク