Statistical multi-metric evaluation and visualization of LLM system predictive performance

要約

生成または識別的な大手言語モデル（LLM）ベースのシステムの評価は、多くの場合、複雑な多次元問題です。
通常、システム構成の代替セットは、1つ以上のベンチマークデータセットで評価され、それぞれ1つ以上の評価メトリックがあり、データセット間で異なる場合があります。
多くの場合、有意性の統計的な尺度を使用して、単一のメトリックに応じて特定のデータセットで、データセット上のメトリック全体の集計またはデータセット間でシステムが異なることを評価したいと考えています。
このような評価は、特定のシステムコンポーネントの変更（LLMの選択またはハイパーパラメーター値の選択など）が現在のシステム構成にわたってパフォーマンスを大幅に改善するかどうか、またはより一般的には、システムの固定セットのパフォーマンスを大幅に改善するかどうかを決定するなど、意思決定をサポートするために行うことができます。
構成（例：リーダーボードリスト）は、関心のあるメトリックに応じて、かなり異なるパフォーマンスを持っています。
正しい統計テストを自動的に実行し、メトリックとデータセット全体で統計結果を適切に集約し、結果を視覚化できるフレームワーク実装を提示します。
このフレームワークは、いくつかの最先端のLLMについて、多言語コード生成ベンチマークCrossCodeevalで実証されています。

要約(オリジナル)

The evaluation of generative or discriminative large language model (LLM)-based systems is often a complex multi-dimensional problem. Typically, a set of system configuration alternatives are evaluated on one or more benchmark datasets, each with one or more evaluation metrics, which may differ between datasets. We often want to evaluate — with a statistical measure of significance — whether systems perform differently either on a given dataset according to a single metric, on aggregate across metrics on a dataset, or across datasets. Such evaluations can be done to support decision-making, such as deciding whether a particular system component change (e.g., choice of LLM or hyperparameter values) significantly improves performance over the current system configuration, or, more generally, whether a fixed set of system configurations (e.g., a leaderboard list) have significantly different performances according to metrics of interest. We present a framework implementation that automatically performs the correct statistical tests, properly aggregates the statistical results across metrics and datasets (a nontrivial task), and can visualize the results. The framework is demonstrated on the multi-lingual code generation benchmark CrossCodeEval, for several state-of-the-art LLMs.

arxiv情報

著者	Samuel Ackerman,Eitan Farchi,Orna Raz,Assaf Toledo
発行日	2025-01-30 10:21:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Statistical multi-metric evaluation and visualization of LLM system predictive performance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー