Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

要約

データ効率の高い学習は、特に大規模なマルチモーダルモデルの現在の傾向を考慮すると、大きな注目を集めています。
最近、データセットの蒸留は、ネットワークトレーニングに不可欠なデータサンプルを合成する効果的なアプローチになりました。
ただし、データセットの蒸留プロセス自体にどのサンプルが不可欠であるかについてはまだ調査されていません。
この作業では、データセット蒸留タスクのデータ効率と選択について研究します。
蒸留のダイナミクスを再定式化することで、理論的にも経験的にも、実際のデータセットに固有の冗長性についての洞察が得られます。
私たちは、経験的な損失値を静的なデータの枝刈り基準として使用することを提案します。
トレーニングでのデータ値の変動をさらに補正するために、蒸留に対する因果関係に基づいて最も寄与するサンプルを見つけます。
提案された選択戦略は、トレーニングデータセットを効率的に活用し、以前の SOTA 蒸留アルゴリズムを上回って、より大規模でより異質なデータセット (完全な ImageNet-1K や Kinetics-400 など) であっても、蒸留アルゴリズムを一貫して強化できます。
私たちは、このパラダイムが蒸留のダイナミクスに新しい道を切り開き、効率的なデータセットの蒸留への道を開くと信じています。
私たちのコードは https://github.com/silicx/GoldFromOres-BiLP で入手できます。

要約(オリジナル)

Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation has become an effective approach by synthesizing data samples that are essential for network training. However, it remains to be explored which samples are essential for the dataset distillation process itself. In this work, we study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset, both theoretically and empirically. We propose to use the empirical loss value as a static data pruning criterion. To further compensate for the variation of the data value in training, we find the most contributing samples based on their causal effects on the distillation. The proposed selection strategy can efficiently exploit the training dataset, outperform the previous SOTA distillation algorithms, and consistently enhance the distillation algorithms, even on much larger-scale and more heterogeneous datasets, e.g., full ImageNet-1K and Kinetics-400. We believe this paradigm will open up new avenues in the dynamics of distillation and pave the way for efficient dataset distillation. Our code is available on https://github.com/silicx/GoldFromOres-BiLP.

arxiv情報

著者	Yue Xu,Yong-Lu Li,Kaitong Cui,Ziyu Wang,Cewu Lu,Yu-Wing Tai,Chi-Keung Tang
発行日	2024-08-07 12:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー