Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

要約

標準化されたベンチマークは、機械学習の進歩を促進します。
ただし、テストを繰り返すと、アルゴリズムがベンチマークの特異性を過剰に利用するため、過剰適合のリスクが増大します。
私たちの取り組みでは、生涯ベンチマークと呼ばれる、拡大し続ける大規模なベンチマークを編集することで、この課題を軽減しようとしています。
私たちのアプローチの例として、(現時点では) それぞれ 169 万個と 198 万個のテストサンプルを含む Lifelong-CIFAR10 と Lifelong-ImageNet を作成しました。
過剰適合を軽減する一方で、生涯にわたるベンチマークでは重要な課題が生じます。それは、拡大し続けるサンプルセット全体で増え続けるモデルを評価するコストが高いということです。
この課題に対処するために、当社では効率的な評価フレームワークである Sort \& Search (S&S) も導入しています。これは、動的プログラミングアルゴリズムを活用して、テストサンプルを選択的にランク付けしてサブ選択することで、以前に評価したモデルを再利用し、費用対効果の高い生涯にわたるベンチマークを可能にします。
31,000 のモデルにわたる広範な実証評価により、S&S が高効率の近似精度測定を実現し、単一の A100 GPU でコンピューティングコストを 180 GPU 日から 5 GPU 時間 (1000 倍の削減) に削減し、近似誤差が低いことが実証されました。
そのため、生涯にわたるベンチマークは、「ベンチマーク枯渇」問題に対する堅牢で実用的な解決策を提供します。

要約(オリジナル)

Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models across an ever-expanding sample set. To address this challenge, we also introduce an efficient evaluation framework: Sort \& Search (S&S), which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples, enabling cost-effective lifelong benchmarking. Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (1000x reduction) on a single A100 GPU, with low approximation error. As such, lifelong benchmarks offer a robust, practical solution to the ‘benchmark exhaustion’ problem.

arxiv情報

著者	Ameya Prabhu,Vishaal Udandarao,Philip Torr,Matthias Bethge,Adel Bibi,Samuel Albanie
発行日	2024-02-29 18:58:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー