Efficient Benchmarking (of Language Models)

要約

言語モデルLMの多機能化に伴い、幅広い機能を総合的に評価する新しいクラスのベンチマークが登場している。このようなベンチマークは、1モデルあたり数千GPU時間に達する膨大な計算コストを伴う。しかし、これらの評価努力の効率的な側面は、文献ではほとんど議論されていない。本研究では、効率的なベンチマーク、すなわち信頼性を損なうことなくLM評価の計算コストをインテリジェントに削減する問題を提示する。HELMベンチマークをテストケースとして用い、異なるベンチマーク設計の選択が計算量と信頼性のトレードオフにどのような影響を与えるかを調査する。我々は、信頼性に対する意思決定の影響度DIoR（Decision Impact on Reliability）という新しい尺度を用いて、このような意思決定の信頼性を評価することを提案する。例えば、ベンチマークから低ランクのモデルを削除するだけで、HELMの現在のリーダーが変わる可能性があることを発見し、正しいベンチマークランキングを得るには、ほんの一握りの例で十分であることを観察する。逆に、HELMシナリオの選択を少し変えるだけで、順位は大きく変わる。この知見に基づき、ベンチマークの信頼性を最小限に抑えつつ、計算量を100倍以上削減することで大幅なコスト削減につながる、より効率的なベンチマークの設計と利用方法に関する一連の具体的な推奨事項を概説する。

要約(オリジナル)

The increasing versatility of language models LMs has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs reaching thousands of GPU hours per model. However the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work we present the problem of Efficient Benchmarking namely intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case we investigate how different benchmark design choices affect the computation-reliability tradeoff. We propose to evaluate the reliability of such decisions by using a new measure Decision Impact on Reliability DIoR for short. We find for example that the current leader on HELM may change by merely removing a low-ranked model from the benchmark and observe that a handful of examples suffice to obtain the correct benchmark ranking. Conversely a slightly different choice of HELM scenarios varies ranking widely. Based on our findings we outline a set of concrete recommendations for more efficient benchmark design and utilization practices leading to dramatic cost savings with minimal loss of benchmark reliability often reducing computation by x100 or more.

arxiv情報

著者	Yotam Perlitz,Elron Bandel,Ariel Gera,Ofir Arviv,Liat Ein-Dor,Eyal Shnarch,Noam Slonim,Michal Shmueli-Scheuer,Leshem Choshen
発行日	2023-08-31 18:18:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Efficient Benchmarking (of Language Models)

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー