M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

要約

大規模な言語モデルは、最近、タスク間の一般化、命令の追従など、さまざまな側面で目覚ましい進歩を遂げています。
複数のタスクにおける大規模な言語モデルの機能を総合的に評価することは非常に重要です。
この論文では、M3KE という大規模なマルチレベルのマルチサブジェクト知識評価ベンチマークを提案します。これは、ゼロショットおよび少数ショット設定でマルチタスクの精度をテストすることによって、中国語の大規模言語モデルによって獲得された知識を測定するために開発されました。
71のタスクから20,477問を集めました。
私たちのセレクションは、小学校から大学までの中国の教育システムのすべての主要レベルをカバーするだけでなく、人文科学、歴史、政治、法律、教育、心理学、科学、技術、芸術、宗教などの幅広い科目もカバーしています。
すべての質問は 4 つの選択肢からなる多肢選択問題であるため、標準化された統一された評価プロセスが保証されます。
私たちは、提案されたベンチマークに基づいて、多数の最先端のオープンソース中国語大規模言語モデルを評価しました。
これらのモデルのサイズは、335M から 130B パラメータまで異なります。
実験結果は、M3KE で約 48% の精度に達する GPT-3.5 よりもパフォーマンスが大幅に低いことを示しています。
データセットは https://github.com/tjunlp-lab/M3KE で入手できます。

要約(オリジナル)

Large language models have recently made tremendous progress in a variety of aspects, e.g., cross-task generalization, instruction following. Comprehensively evaluating the capability of large language models in multiple tasks is of great importance. In this paper, we propose M3KE, a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to measure knowledge acquired by Chinese large language models by testing their multitask accuracy in zero- and few-shot settings. We have collected 20,477 questions from 71 tasks. Our selection covers all major levels of Chinese education system, ranging from the primary school to college, as well as a wide variety of subjects, including humanities, history, politics, law, education, psychology, science, technology, art and religion. All questions are multiple-choice questions with four options, hence guaranteeing a standardized and unified assessment process. We’ve assessed a number of state-of-the-art open-source Chinese large language models on the proposed benchmark. The size of these models varies from 335M to 130B parameters. Experiment results demonstrate that they perform significantly worse than GPT-3.5 that reaches an accuracy of ~ 48% on M3KE. The dataset is available at https://github.com/tjunlp-lab/M3KE.

arxiv情報

著者	Chuang Liu,Renren Jin,Yuqi Ren,Linhao Yu,Tianyu Dong,Xiaohan Peng,Shuting Zhang,Jianxiang Peng,Peiyi Zhang,Qingqing Lyu,Xiaowen Su,Qun Liu,Deyi Xiong
発行日	2023-05-17 14:56:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー