One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

要約

言語は一枚岩ではありません。
ベンチマーク (複数の言語用に設計されたベンチマークを含む) は、大規模言語モデル (LLM) のパフォーマンスを評価するための代用としてよく使用されますが、言語内のバリエーションのニュアンスを見落とす傾向があるため、非言語の話者の体験をモデル化することができません。
-標準的な方言。
アフリカ系アメリカ人の日常英語（AAVE）に焦点を当て、アルゴリズム、数学、論理、統合推論などの正規推論タスクにおける方言の処理におけるLLMの公平性と堅牢性を客観的に評価することを目的とした最初の研究を紹介します。
標準化英語と AAVE の 1.2K 以上の並列クエリペアを含むベンチマークである \textbf{ReDial} (\textbf{Re}asoning with \textbf{Dial}ect Queries) を紹介します。
私たちは、コンピューターサイエンスの背景を持つ専門家を含む AAVE の講演者を雇い、HumanEval や GSM8K などの 7 つの一般的なベンチマークを書き換えます。
ReDial を使用して、GPT、Claude、Llama、Mistral、Phi モデルファミリなど、広く使用されている LLM を評価します。
私たちの調査結果では、\textbf{これらの広く使用されているモデルのほぼすべてが、AAVE のクエリに対して重大な脆弱性と不公平性を示している}ことが明らかになりました。
私たちの研究は、方言クエリにおける LLM バイアスを分析するための体系的かつ客観的なフレームワークを確立しています。
さらに、主流の LLM が推論タスクにおいて方言話者にどのように不当なサービスを提供しているかを強調し、関連する将来の研究のための重要な基盤を築きます。
コードとデータは https://github.com/fangru-lin/redial_dialect_robustness_fairness からアクセスできます。

要約(オリジナル)

Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation, and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects in canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce \textbf{ReDial} (\textbf{Re}asoning with \textbf{Dial}ect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that \textbf{almost all of these widely used models show significant brittleness and unfairness to queries in AAVE}. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for relevant future research. Code and data can be accessed at https://github.com/fangru-lin/redial_dialect_robustness_fairness.

arxiv情報

著者	Fangru Lin,Shaoguang Mao,Emanuele La Malfa,Valentin Hofmann,Adrian de Wynter,Xun Wang,Si-Qing Chen,Michael Wooldridge,Janet B. Pierrehumbert,Furu Wei
発行日	2025-01-14 09:52:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー