Scale Efficient Training for Large Datasets

要約

データセットスケールの急速な成長は、深い学習研究を進めるための重要な要因となっています。
ただし、データセットスケールが増加するにつれて、過度の冗長サンプル、過度に挑戦するサンプル、モデル改善にほとんど寄与しない非効率的な簡単なサンプルを含む低価値サンプルの存在により、トレーニングプロセスがますます非効率になります。
低価値のサンプルを除去するために、SETAは最初にランダムな剪定を実行して冗長サンプルを排除し、次に損失によって測定された学習困難に応じて残りのサンプルをクラスターします。
このクラスタリングに基づいて、スライディングウィンドウ戦略が採用され、簡単なカリキュラムに従って過度に挑戦的で非効率的な簡単なクラスターの両方を徐々に除去します。私たちは、TOCA、SS1M、およびST+MJを含む大規模な合成データセットで、それぞれ300万を超えるサンプルを含む大規模な合成データセットで広範な実験を実施します。
70 \％コスト削減。
さらに、さまざまなバックボーン（CNN、トランス、マンバ）のさまざまなスケールの実際のデータセットと多様なタスク（命令チューニング、マルチビューステレオ、ジオローカリゼーション、構成画像検索、画像セグメンテーションを参照）での実験は、アプローチの強力な効果と普遍性を示しています。
コードはhttps://github.com/mrazhou/setaで入手できます。

要約(オリジナル)

The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.

arxiv情報

著者	Qing Zhou,Junyu Gao,Qi Wang
発行日	2025-03-17 17:13:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scale Efficient Training for Large Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー