An Improved Traditional Chinese Evaluation Suite for Foundation Model

要約

繁体字中国語の理解のために設計された新しいベンチマークである TMMLU+ を紹介します。
TMMLU+ は、初級レベルから専門レベルまでの 66 の被験者を含む多肢選択式の質問応答データセットです。
これは、前任の台湾大規模マルチタスク言語理解 (TMMLU) よりも 6 倍の規模で、よりバランスのとれた科目分布を誇ります。
また、提案されている TMMLU+ で、クローズドソースモデルと 1.8B から 72B の範囲のパラメータを持つ 26 のオープンウェイト中国語大言語モデル (LLM) のベンチマークも行います。
私たちの調査結果では、(1.) 繁体字中国語モデルは依然として簡体字中国語モデルに後れを取っていることが明らかになり、繁体字中国語に対応した LLM のより焦点を絞った進歩の必要性が強調されています。
(2.) 現在の LLM は平均スコアにおいて依然として人間の成績に及ばず、将来の研究で社会科学と人文科学の主題をさらに深く掘り下げる必要性が潜在的にあることを示しています。
(3.) 調査したすべてのトークン化圧縮メトリクスの中で、妊孕性スコアのみがベンチマーク結果との強い相関を独自に示していることがわかりました。
私たちは、TMMLU+ が将来のモデル改善の領域を正確に特定し、それによって機械と人間の言語能力の間のギャップを縮め、繁体字中国語 LLM の開発における研究者をサポートすると予測しています。
私たちのデータセットとベンチマークソースコードは、huggingface.co/datasets/ikala/tmmluplus からアクセスできます。

要約(オリジナル)

We present TMMLU+, a new benchmark designed for Traditional Chinese language understanding. TMMLU+ is a multi-choice question-answering dataset with 66 subjects from elementary to professional level. It is six times larger and boasts a more balanced subject distribution than its predecessor, Taiwan Massive Multitask Language Understanding (TMMLU). We also benchmark closed-source models and 26 open-weight Chinese large language models (LLMs) of parameters ranging from 1.8B to 72B on the proposed TMMLU+. Our findings reveal that (1.) Traditional Chinese models still trail behind their Simplified Chinese counterparts, highlighting a need for more focused advancements in LLMs catering to Traditional Chinese. (2.) Current LLMs still fall short of human performance in average scores, indicating a potential need for future research to delve deeper into social science and humanities subjects. (3.) Among all the tokenization compression metrics examined, we identify that only the fertility score uniquely demonstrates strong correlations with our benchmark results. We foresee that TMMLU+ will pinpoint areas for future model improvement, thereby narrowing the gap between machine and human linguistic capabilities and supporting researchers in developing Traditional Chinese LLMs. Our dataset, along with the benchmark source code, is accessible at huggingface.co/datasets/ikala/tmmluplus.

arxiv情報

著者	Zhi-Rui Tam,Ya-Ting Pai,Yen-Wei Lee,Jun-Da Chen,Wei-Min Chu,Sega Cheng,Hong-Han Shuai
発行日	2024-07-10 15:11:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Improved Traditional Chinese Evaluation Suite for Foundation Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー