Don’t Make Your LLM an Evaluation Benchmark Cheater

要約

大規模言語モデル(LLM)は人工知能の最前線を大きく前進させ、モデル能力の著しい向上を達成している。モデルの性能を評価するために、LLMの能力レベルを様々な側面から測定するための評価ベンチマークを構築することが典型的なアプローチである。数多くの高品質なベンチマークが発表されているにもかかわらず、これらのベンチマークの適切な使用と異なるモデルの公正な比較に関する懸念はますます高まっている。本稿では、このような懸念を考慮し、評価ベンチマークを不適切に使用し、評価結果を誤って解釈することの潜在的なリスクと影響について議論する。特に、不適切な評価につながる特別な問題点として、評価セットに関連するデータがモデルのトレーニングに使用されることがあることに着目する。この現象は、モデルテストに先立って事前学習データが準備されることが多いため、現在では一般的になっている。我々は、ベンチマークの活用の効果を研究するために広範な実験を行い、それが評価結果を劇的に高める可能性があり、最終的にモデル性能の信頼できない評価につながることを発見した。既存の評価ベンチマークの利用を改善するために、LLM開発者とベンチマークメンテナーの双方にいくつかのガイドラインを提示する。この研究が、LLMの適切な訓練と評価に注目を集めることを期待している。

要約(オリジナル)

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs in different aspects. Despite that a number of high-quality benchmarks have been released, the concerns about the appropriate use of these benchmarks and the fair comparison of different models are increasingly growing. Considering these concerns, in this paper, we discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results. Specially, we focus on a special issue that would lead to inappropriate evaluation, \ie \emph{benchmark leakage}, referring that the data related to evaluation sets is occasionally used for model training. This phenomenon now becomes more common since pre-training data is often prepared ahead of model test. We conduct extensive experiments to study the effect of benchmark leverage, and find that it can dramatically boost the evaluation results, which would finally lead to an unreliable assessment of model performance. To improve the use of existing evaluation benchmarks, we finally present several guidelines for both LLM developers and benchmark maintainers. We hope this work can draw attention to appropriate training and evaluation of LLMs.

arxiv情報

著者	Kun Zhou,Yutao Zhu,Zhipeng Chen,Wentong Chen,Wayne Xin Zhao,Xu Chen,Yankai Lin,Ji-Rong Wen,Jiawei Han
発行日	2023-11-03 14:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Don’t Make Your LLM an Evaluation Benchmark Cheater

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー