Benchmarking LLMs via Uncertainty Quantification

要約

さまざまな機関によるオープンソースの大規模言語モデル (LLM) の普及により、包括的な評価方法の緊急の必要性が浮き彫りになっています。
しかし、広く知られている HuggingFace オープン LLM リーダーボードなどの現在の評価プラットフォームは、LLM を徹底的に評価するために不可欠な重要な側面、つまり不確実性を無視しています。
このギャップを埋めるために、不確実性の定量化を統合した LLM の新しいベンチマークアプローチを導入します。
私たちの試験には、5 つの代表的な自然言語処理タスクにわたる 8 つの LLM (LLM シリーズ) が含まれます。
さらに、予測精度と予測の不確実性の両方を考慮する、不確実性を考慮した評価指標 UAcc を導入します。
私たちの調査結果は次のことを明らかにしています。 I) 精度が高い LLM は確実性が低い可能性があります。
II) 大規模な LLM は、小規模な LLM と比較して、より大きな不確実性を示す可能性があります。
III) 命令の微調整は LLM の不確実性を高める傾向があります。
不確実性を考慮に入れることで、新しい UAcc メトリクスは、ある LLM の別の LLM に対する相対的な改善を増幅または減少させることができ、さらに 2 つの LLM の相対的なランキングを変更する可能性もあります。
これらの結果は、LLM の評価に不確実性を組み込むことの重要性を強調しています。

要約(オリジナル)

The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect — uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves eight LLMs (LLM series) spanning five representative natural language processing tasks. Additionally, we introduce an uncertainty-aware evaluation metric, UAcc, which takes into account both prediction accuracy and prediction uncertainty. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. By taking uncertainty into account, our new UAcc metric can either amplify or diminish the relative improvement of one LLM over another and may even change the relative ranking of two LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs.

arxiv情報

著者	Fanghua Ye,Mingming Yang,Jianhui Pang,Longyue Wang,Derek F. Wong,Emine Yilmaz,Shuming Shi,Zhaopeng Tu
発行日	2024-01-23 14:29:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking LLMs via Uncertainty Quantification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー