Efficient multi-prompt evaluation of LLMs

要約

LLM を比較するための最も一般的なベンチマークは、限られたプロンプトテンプレートのセットに依存しているため、LLM の能力を完全に把握できない可能性があり、リーダーボードでの結果の再現性に影響を与える可能性があります。
最近の研究の多くは、プロンプト感度を経験的に検証し、LLM 評価の変更を提唱しています。
このペーパーでは、評価に使用する単一のプロンプトを見つけるのではなく、多くのプロンプトバリアントにわたるパフォーマンスの分布を推定する問題を検討します。
PromptEval を紹介します。これは、プロンプト全体の強度を借用する大規模なプロンプト全体のパフォーマンスを推定する方法であり、実際的な評価予算の下で正確な推定値を生成するための例です。
結果の分布を使用してパフォーマンス分位数を取得し、さまざまな堅牢なパフォーマンス指標 (上位 95% の分位数または中央値など) を構築できます。
PromptEval が一貫してパフォーマンス分布を推定し、その有効性を 3 つの著名な LLM ベンチマーク (MMLU、BIG-bench Hard、LMentry) で経験的に実証していることを証明します。
たとえば、PromptEval は、2 つの単一プロンプト評価に相当する予算で、MMLU 上の 100 個のプロンプトテンプレートにわたるパフォーマンスの分位点を正確に推定できます。
コードとデータは https://github.com/felipemaiapolo/prompt-eval でご覧いただけます。

要約(オリジナル)

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs’ abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. For example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Our code and data can be found at https://github.com/felipemaiapolo/prompt-eval.

arxiv情報

著者	Felipe Maia Polo,Ronald Xu,Lucas Weber,Mírian Silva,Onkar Bhardwaj,Leshem Choshen,Allysson Flavio Melo de Oliveira,Yuekai Sun,Mikhail Yurochkin
発行日	2024-05-27 14:24:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient multi-prompt evaluation of LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー