Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

要約

機械学習におけるマルチタスク・ベンチマークを、社会的選択理論のレンズを通して考察する。モデルは候補者であり、タスクは有権者である。このことは、ベンチマークシステムを基数型と順序型に区別することを示唆している。前者は数値スコアを1つのモデルランキングに集約し、後者は各タスクのランキングを集約する。アローの不可能性定理を序数ベンチマークに適用し、序数システムの本質的な限界、特に無関係なモデルが含まれる場合の感度を明らかにする。アローの定理に触発され、既存のマルチタスクベンチマークにおいて、多様性と無関係な変化に対する感度の間に強いトレードオフがあることを実証的に示す。我々の結果は、我々が導入した多様性と感度の新しい定量的尺度に基づいている。感度は、タスクの無関係な変更がベンチマークに与える影響を定量化する。多様性は、タスク間のモデルランキングの不一致の程度を捉える。厳密な計算は計算上困難であるため、我々はこの2つの尺度の効率的な近似アルゴリズムを開発する。7つのカーディナルベンチマークと11のオーディナルベンチマークでの広範な実験を通して、多様性と安定性の間の明確なトレードオフを実証する：マルチタスクベンチマークが多様であればあるほど、些細な変化に対してより敏感になる。さらに、既存のベンチマークの集計ランキングは、無関係な変更に対して非常に不安定であることを示す。コードとデータはhttps://socialfoundations.github.io/benchbench/。

要約(オリジナル)

We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow’s impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow’s theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.

arxiv情報

著者	Guanhua Zhang,Moritz Hardt
発行日	2024-05-06 15:09:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー