Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

要約

言語モデル (LM) の最近の進歩により、これらのモデルの一般的な機能を評価するために設計された複数のベンチマークの作成が促進されました。
ただし、重要なタスクは、ベンチマーク自体の妥当性を評価することです。
これは、ベンチマーク合意テスト (BAT) によって最も一般的に行われ、何らかの合意指標 (ランク相関など) を使用して、新しいベンチマークが確立されたベンチマークに対して検証されます。
BAT はベンチマーク作成者と消費者にとって重要な役割を果たしているにもかかわらず、そのような合意テストのための標準化された手順はありません。
この欠陥により、無効な結論が導き出され、ベンチマークに対する不信感が助長され、使用する適切なベンチマークを適切に選択する能力が台無しになる可能性があります。
40 を超える著名なベンチマークを分析することで、見落とされた方法論の選択が BAT の結果にどのように大きな影響を及ぼし、結論の妥当性を損なう可能性があるかを示します。
これらの不一致に対処するために、BAT の一連のベストプラクティスを提案し、これらの方法論を利用することで BAT の堅牢性と有効性がどのように大幅に向上するかを実証します。
導入を促進し、将来の研究を促進するために、BAT 用の Python パッケージである BenchBench を導入し、ピアを使用してベンチマークを評価するように設計されたメタベンチマークである BenchBench-leaderboard をリリースします。
私たちの調査結果は、進化する言語モデル研究の状況においてベンチマーク評価の堅牢性と妥当性を確保する、標準化された BAT の必要性を強調しています。
BenchBench パッケージ: github.com/IBM/BenchBench リーダーボード: hf.co/spaces/IBM/BenchBench

要約(オリジナル)

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models’ general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: github.com/IBM/BenchBench Leaderboard: hf.co/spaces/IBM/BenchBench

arxiv情報

著者	Yotam Perlitz,Ariel Gera,Ofir Arviv,Asaf Yehudai,Elron Bandel,Eyal Shnarch,Michal Shmueli-Scheuer,Leshem Choshen
発行日	2024-09-12 08:36:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー