GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

要約

LLM の社会的バイアスは、通常、バイアスベンチマークデータセットを通じて測定されます。
現在のベンチマークには、範囲、根拠、品質、必要な人的労力の点で制限があります。
これまでの研究では、クラウドソーシングではなくコミュニティソースによるベンチマーク開発アプローチで成功を収めてきました。
ただし、この作業には、関連する実体験を持つアノテーターによる多大な労力が必要でした。
このペーパーでは、LLM (具体的には GPT-3.5-Turbo) が、自由形式のコミュニティ調査への回答からバイアスベンチマークデータセットを開発するタスクを支援できるかどうかを検討します。
また、以前の研究を新しいコミュニティと一連の偏見、つまりユダヤ人コミュニティと反ユダヤ主義に拡張しました。
私たちの分析によると、GPT-3.5-Turbo はこの注釈タスクのパフォーマンスが低く、出力に許容できない品質問題が発生します。
したがって、GPT-3.5-Turbo は、社会的バイアスに関連する機密性の高いタスクにおける人間によるアノテーションの適切な代替品ではなく、GPT-3.5-Turbo の使用は、コミュニティ調達バイアスベンチマークの利点の多くを実際に無効にするものであると結論付けています。

要約(オリジナル)

Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.

arxiv情報

著者	Virginia K. Felkner,Jennifer A. Thompson,Jonathan May
発行日	2024-05-24 17:56:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー