Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

要約

事前トレーニング済み言語モデル (PLM) から大規模言語モデル (LLM) に至るまで、自然言語処理 (NLP) の分野では、急激なパフォーマンスの向上と幅広い実用化が実現しています。
研究分野の評価は、その改善の方向性を導きます。
ただし、LLM を徹底的に評価することは、2 つの理由から非常に困難です。
まず、LLM の優れたパフォーマンスにより、従来の NLP タスクは不十分になります。
第 2 に、既存の評価タスクでは、現実世界のシナリオにおける幅広いアプリケーションに対応することが困難です。
これらの問題に取り組むために、既存の研究では、LLM をより適切に評価するためのさまざまなベンチマークが提案されています。
学界と産業界の両方における多数の評価課題を明確にするために、LLM 評価に関する複数の論文を調査します。
推論、知識、信頼性、安全性を含む、LLM の 4 つのコアコンピテンシーを要約します。
すべてのコンピテンシーについて、その定義、対応するベンチマーク、指標を紹介します。
このコンピテンシーアーキテクチャでは、同様のタスクが結合されて対応する能力が反映され、新しいタスクをシステムに簡単に追加することもできます。
最後に、LLM の評価の将来の方向性についての提案を行います。

要約(オリジナル)

From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM’s evaluation.

arxiv情報

著者	Ziyu Zhuang,Qiguang Chen,Longxuan Ma,Mingda Li,Yi Han,Yushan Qian,Haopeng Bai,Zixian Feng,Weinan Zhang,Ting Liu
発行日	2023-08-15 17:40:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー