Measuring Taiwanese Mandarin Language Understanding

要約

大規模言語モデル (LLM) の評価は、最近この分野で大きな注目を集めています。
この研究は、中国語のコンテキスト、特に既存のベンチマークではほとんど代表されていない繁体字中国語における LLM の評価に焦点を当てています。
台湾中国語のコンテキストの下で、LLM の高度な知識と推論能力を評価するために調整された総合評価スーツである TMLU を紹介します。
TMLU は、社会科学、STEM、人文科学、台湾特有の内容など、中学校から専門レベルまでの 37 科目で構成されています。
さらに、複雑な推論スキルの評価を容易にするために、各科目の思考連鎖のような数ショットの説明を厳選しています。
包括的なベースラインを確立するために、24 の高度な LLM に対して広範な実験と分析を実施します。
この結果は、中国語のオープンウェイトモデルは多言語独自のモデルと比較してパフォーマンスが劣っており、台湾標準語に合わせたオープンウェイトモデルは簡体字中国語のモデルに比べて遅れていることを示唆しています。
この調査結果は、改善の余地が大きいことを示しており、ローカライズされた台湾中国語 LLM の開発を促進するという TMLU の目標を強調しています。
今後の研究を促進するために、ベンチマークと評価スクリプトをコミュニティに公開します。

要約(オリジナル)

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

arxiv情報

著者	Po-Heng Chen,Sijia Cheng,Wei-Lin Chen,Yen-Ting Lin,Yun-Nung Chen
発行日	2024-03-29 13:56:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Measuring Taiwanese Mandarin Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー