Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination

要約

この記事では、スペイン語と英語の大学入学レベル試験の多肢選択問題 1003 問で構成されるバイリンガルデータセットである UNED-ACCESS 2024 を紹介します。
質問はもともとスペイン語で作成され、手動で英語に翻訳されたもので、これまで公に公開されたことはありません。
現在のオープンソースおよび独自のモデルの選択は、UNED-ACCESS 2024 データセットと MMLU 質問の同等のサブセットの両方で、均一なゼロショット実験設定で評価されます。
結果は、(i) 推論問題はモデルにとって困難である、(ii) 小規模なモデルは大規模なモデルよりもパフォーマンスが悪く、英語よりもスペイン語の方が早く劣化する、および (iii) 言語間のパフォーマンスの差は最良のモデルでは無視でき、最大で
小型モデルの場合は 37%。
UNED-ACCESS 2024 のモデルランキングは英語とスペイン語でほぼ同一であり、MMLU のランキングと高い相関関係 (0.98 ピアソン) もあり、小規模なデータセットが分野ごとのパフォーマンスを測定するのに十分に多様で代表的なものであることを示唆しています。

要約(オリジナル)

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) reasoning questions are challenging for models, (ii) smaller models perform worse than larger models and degrade faster in Spanish than in English and (iii) the performance gap between languages is negligible for the best models and grows up to 37% for smaller models. Model ranking on UNED-ACCESS 2024 is almost identical in English and Spanish, and has also a high correlation (0.98 Pearson) with ranking on MMLU, suggesting that a small dataset is sufficiently diverse and representative to measure performance by discipline.

arxiv情報

著者	Eva Sánchez Salido,Roser Morante,Julio Gonzalo,Guillermo Marco,Jorge Carrillo-de-Albornoz,Laura Plaza,Enrique Amigó,Andrés Fernández,Alejandro Benito-Santos,Adrián Ghajari Espinosa,Victor Fresno
発行日	2025-01-14 16:41:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー