ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

要約

大規模言語モデル (LLM) の前例のないパフォーマンスには、包括的かつ正確な評価が必要です。
私たちは、LLM を評価するには、ベンチマークが包括的かつ体系的である必要があると主張します。
この目的のために、私たちは次のような強みを持つ ZhuJiu ベンチマークを提案します。 (1) 多次元の能力範囲: 51 のタスクをカバーする 7 つの能力次元にわたって LLM を包括的に評価します。
特に、LLM の知識能力に焦点を当てた新しいベンチマークも提案します。
(2) 多面的な評価手法の連携：3つの異なる評価手法を相互に補完しながらLLMを総合的に評価することで、評価結果の権威性と正確性を確保します。
(3) 包括的な中国語ベンチマーク: ZhuJiu は、LLM を中国語で完全に評価すると同時に、英語でも同様に堅牢な評価機能を提供する先駆的なベンチマークです。
(4) 潜在的なデータ漏洩の回避: データ漏洩を回避するために、37 のタスクに特化した評価データを構築します。
現在主流の 10 個の LLM を評価し、その結果について詳細な議論と分析を実施します。
ZhuJiu ベンチマークと自由参加型リーダーボードは http://www.zhujiu-benchmark.com/ で公開されており、デモビデオも https://youtu.be/qypkJ89L1Ic で提供されています。

要約(オリジナル)

The unprecedented performance of large language models (LLMs) requires comprehensive and accurate evaluation. We argue that for LLMs evaluation, benchmarks need to be comprehensive and systematic. To this end, we propose the ZhuJiu benchmark, which has the following strengths: (1) Multi-dimensional ability coverage: We comprehensively evaluate LLMs across 7 ability dimensions covering 51 tasks. Especially, we also propose a new benchmark that focuses on knowledge ability of LLMs. (2) Multi-faceted evaluation methods collaboration: We use 3 different yet complementary evaluation methods to comprehensively evaluate LLMs, which can ensure the authority and accuracy of the evaluation results. (3) Comprehensive Chinese benchmark: ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English. (4) Avoiding potential data leakage: To avoid data leakage, we construct evaluation data specifically for 37 tasks. We evaluate 10 current mainstream LLMs and conduct an in-depth discussion and analysis of their results. The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/ and we also provide a demo video at https://youtu.be/qypkJ89L1Ic.

arxiv情報

著者	Baoli Zhang,Haining Xie,Pengfan Du,Junhao Chen,Pengfei Cao,Yubo Chen,Shengping Liu,Kang Liu,Jun Zhao
発行日	2023-08-28 06:56:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー