MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

要約

従来のベンチマークは、多言語や文化的に多様な文脈でますます洗練された言語モデルを評価するのに苦労しています。
このギャップに対処するために、言語ごとに約11,829の質問を伴う13の類型的に多様な言語をカバーする包括的な多言語ベンチマークであるMMLU-Proxを紹介します。
MMLU-Proの挑戦的な推論に焦点を当てた設計に基づいて、私たちのフレームワークは半自動翻訳プロセスを採用しています。最先端の大規模な言語モデル（LLM）によって生成される翻訳は、概念的な精度、用語の一貫性、および文化的関連性を確保するために、専門家のアノテーターによって厳密に評価されます。
5ショットのチェーン（COT）およびゼロショットプロンプト戦略を使用して、25の最先端のLLMを包括的に評価し、言語的および文化的境界全体でパフォーマンスを分析します。
私たちの実験は、高リソース言語から低リソースの言語への一貫した性能劣化を明らかにしています。最高のモデルは英語で70％以上の精度を達成しますが、スワヒリ語のような言語では約40％に低下し、最近の進歩にもかかわらず多言語機能の永続的なギャップを強調しています。
MMLU-Proxは進行中のプロジェクトです。
追加の言語を組み込み、より多くの言語モデルを評価して、多言語機能のより包括的な評価を提供することにより、ベンチマークを拡大しています。

要約(オリジナル)

Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

arxiv情報

著者	Weihao Xuan,Rui Yang,Heli Qi,Qingcheng Zeng,Yunze Xiao,Yun Xing,Junjue Wang,Huitao Li,Xin Li,Kunyu Yu,Nan Liu,Qingyu Chen,Douglas Teodoro,Edison Marrese-Taylor,Shijian Lu,Yusuke Iwasawa,Yutaka Matsuo,Irene Li
発行日	2025-03-13 15:59:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー