KoLA: Carefully Benchmarking World Knowledge of Large Language Models

要約

大規模言語モデル (LLM) の前例のないパフォーマンスにより、評価の改善が必要です。
私たちは、LLM の幅広い能力を単に探求するのではなく、綿密で思慮深い設計が、徹底的で公平な、適切な評価に不可欠であると信じています。
LLM にとって世界の知識の重要性を考慮して、私たちは知識指向 LLM 評価ベンチマーク (KoLA) を構築し、その中で 3 つの重要な要素を慎重に設計します。 (1) 能力モデリングでは、人間の認知を模倣して、次の 4 レベルの分類を形成します。
知識関連の能力。$19$ のタスクをカバーします。
(2) データについては、公平な比較を確保するために、LLM によって事前にトレーニングされたコーパスである Wikipedia と、目に見えないデータと進化する知識を処理する能力を評価することを目的として継続的に収集される新たなコーパスの両方を使用します。
(3) 評価基準については、タスクやモデル間の数値的比較性を向上させるための総合標準スコアと、知識幻覚を自動的に評価するための独自の自己コントラスト指標を含む対照的なシステムを採用しています。
私たちは $21$ のオープンソースおよび商用 LLM を評価し、いくつかの興味深い発見を得ました。
KoLA データセットとオープン参加リーダーボードは https://kola.xlore.cn で公開されており、LLM および知識関連システムの開発に参考となるよう継続的に更新されます。

要約(オリジナル)

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate $21$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

arxiv情報

著者	Jifan Yu,Xiaozhi Wang,Shangqing Tu,Shulin Cao,Daniel Zhang-Li,Xin Lv,Hao Peng,Zijun Yao,Xiaohan Zhang,Hanming Li,Chunyang Li,Zheyuan Zhang,Yushi Bai,Yantao Liu,Amy Xin,Nianyi Lin,Kaifeng Yun,Linlu Gong,Jianhui Chen,Zhili Wu,Yunjia Qi,Weikai Li,Yong Guan,Kaisheng Zeng,Ji Qi,Hailong Jin,Jinxin Liu,Yu Gu,Yuan Yao,Ning Ding,Lei Hou,Zhiyuan Liu,Bin Xu,Jie Tang,Juanzi Li
発行日	2023-06-15 17:20:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー