Exploring Dataset-Scale Indicators of Data Quality

要約

最新のコンピュータービジョン基盤モデルは、大量のデータに基づいてトレーニングされるため、多大な経済的および環境的コストが発生します。
最近の研究では、データの品質を向上させることで、必要なデータ量を大幅に削減できることが示唆されています。
しかし、コンピュータビジョンにおけるデータ品質とは何でしょうか?
私たちは、特定のデータセットの品質は異なるサンプルレベルの構成要素とデータセットレベルの構成要素に分解でき、前者は後者よりも広範囲に研究されていると仮定します。
ラベルセット設計とクラスバランスという 2 つの重要なデータセットレベルの構成要素の影響を除去します。
当社が提供する主要な指標を使用してこれらの構成要素を監視することで、研究者や専門家は、分布の変化に対する精度と堅牢性の観点から測定されるモデルのパフォーマンスをより適切に予測できるようになります。

要約(オリジナル)

Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents, and that the former have been more extensively studied than the latter. We ablate the effects of two important dataset-level constituents: label set design, and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.

arxiv情報

著者	Benjamin Feuer,Chinmay Hegde
発行日	2023-11-07 14:14:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Dataset-Scale Indicators of Data Quality

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー