CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

要約

大規模言語モデル (LLM) の進歩により、命令に従って、目に見えないさまざまな自然言語処理 (NLP) タスクを一般化する機能が強化されました。
しかし、中国語のようなリソースの少ない言語ではその有効性が低下することが多く、データ漏洩による偏った評価によってさらに悪化し、新しい言語領域への真の汎用性には疑問が投げかけられています。
これに応えて、中国語に対する LLM のゼロショット一般化可能性を評価するために設計された中国語命令追従ベンチマーク (CIF-Bench) を導入します。
CIF-Bench は 150 のタスクと 15,000 の入出力ペアで構成され、20 のカテゴリにわたって複雑な推論と中国文化のニュアンスをテストするためにネイティブスピーカーによって開発されました。
データ汚染を軽減するために、データセットの半分のみを公開し、残りは非公開にし、スコアの差異を最小限に抑えるための多様な命令を導入し、合計 45,000 のデータインスタンスに達します。
選択した 28 個の LLM を評価したところ、最も優れたモデルのスコアが 52.9% に過ぎず、顕著なパフォーマンスのギャップが明らかになり、あまり馴染みのない言語やタスクのコンテキストにおける LLM の限界が浮き彫りになりました。
この研究は、中国語のタスクを処理する際の LLM の現在の限界を明らかにするだけでなく、将来の LLM の一般化可能性研究の新しい基準を設定し、より適応性があり、文化的に情報に基づいた、言語的に多様なモデルの開発を推進します。

要約(オリジナル)

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.

arxiv情報

著者	Yizhi LI,Ge Zhang,Xingwei Qu,Jiali Li,Zhaoqun Li,Zekun Wang,Hao Li,Ruibin Yuan,Yinghao Ma,Kai Zhang,Wangchunshu Zhou,Yiming Liang,Lei Zhang,Lei Ma,Jiajun Zhang,Zuowen Li,Stephen W. Huang,Chenghua Lin,Jie Fu
発行日	2024-06-04 14:26:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー