SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

要約

大規模言語モデル (LLM) が急速に開発されているため、さまざまなドメインで LLM の能力を評価できるベンチマークを用意することが重要です。
LLM の一般的な用途の 1 つは、アルゴリズムの作成、データベースのクエリ、数学的証明など、科学的なトピックに関するタスクを実行することです。
このような課題で大学生が評価される方法に触発されて、この論文では、科学的課題を解決する LLM の能力を評価するために、大学のコンピューターサイエンスの試験問題で構成されるベンチマークである SciEx を提案します。
SciEx は、(1) 多言語で英語とドイツ語の両方の試験が含まれ、(2) 画像を含む問題が含まれるマルチモーダルで、(3) 大学試験の性質上、難易度の異なるさまざまな種類の自由形式の問題が含まれています。
。
新しいベンチマークでさまざまな最先端の LLM のパフォーマンスを評価します。
SciEx の質問は自由形式であるため、LLM のパフォーマンスを評価するのは簡単ではありません。
したがって、私たちは SciEx で LLM 出力に対して人間の専門家によるグレーディングを提供します。
SciEx の自由形式試験は現在の LLM にとって依然として困難であり、最高の LLM が平均で試験成績の 59.4\% しか達成していないことがわかります。
また、LLM のパフォーマンスと SciEx での学生のパフォーマンスの詳細な比較も提供します。
新しい LLM の将来の評価を可能にするために、LLM を審査員として使用して SciEx で LLM の回答を採点することを提案します。
私たちの実験によると、LLM は試験を解く上で完璧なパフォーマンスを発揮するわけではありませんが、採点者としてはまともであり、エキスパートによる採点とのピアソン相関が 0.948 を達成していることがわかりました。

要約(オリジナル)

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx – a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.

arxiv情報

著者	Tu Anh Dinh,Carlos Mullov,Leonard Bärmann,Zhaolin Li,Danni Liu,Simon Reiß,Jueun Lee,Nathan Lerzer,Fabian Ternava,Jianfeng Gao,Tobias Röddiger,Alexander Waibel,Tamim Asfour,Michael Beigl,Rainer Stiefelhagen,Carsten Dachsbacher,Klemens Böhm,Jan Niehues
発行日	2024-07-12 10:17:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー