Effective Vision Transformer Training: A Data-Centric Perspective

要約

ビジョントランスフォーマー (ViT) は、畳み込みニューラルネットワーク (CNN) と比較して有望なパフォーマンスを示していますが、ViT のトレーニングは CNN よりもはるかに困難です。
このホワイトペーパーでは、動的データ比率 (DDP) や知識同化率 (KAR) などのいくつかのメトリックを定義して、トレーニングプロセスを調査し、それに応じて形成、成長、探索の 3 つの期間に分割します。
特に、トレーニングの最終段階では、モデルの最適化にトレーニングサンプルのごく一部しか使用されていないことがわかります。
ViT のデータを大量に消費する性質を考慮して、単純だが重要な質問をします。トレーニングのすべての段階で豊富な「効果的な」トレーニング例を提供することは可能ですか?
この問題に対処するには、2 つの重要な問題に対処する必要があります。つまり、個々のトレーニング例の「有効性」を測定する方法と、十分な数の「有効な」例が不足しているときに体系的に生成する方法です。
最初の質問に答えるために、訓練サンプルの「難しさ」が訓練サンプルの「有効性」を測る指標として採用できることがわかりました。
2 番目の質問に対処するために、これらの進化段階でトレーニングデータの「難易度」分布を動的に調整することを提案します。
これら 2 つの目的を達成するために、トレーニングサンプルの「難易度」を動的に測定し、さまざまなトレーニング段階でモデルの「効果的な」サンプルを生成する、新しいデータ中心の ViT トレーニングフレームワークを提案します。
さらに、「有効な」サンプルの数をさらに拡大し、ViT の後期トレーニング段階でのオーバーフィッティングの問題を軽減するために、PatchErasing と呼ばれるパッチレベルの消去戦略を提案します。
広範な実験により、提案されたデータ中心の ViT トレーニングフレームワークと手法の有効性が実証されました。

要約(オリジナル)

Vision Transformers (ViTs) have shown promising performance compared with Convolutional Neural Networks (CNNs), but the training of ViTs is much harder than CNNs. In this paper, we define several metrics, including Dynamic Data Proportion (DDP) and Knowledge Assimilation Rate (KAR), to investigate the training process, and divide it into three periods accordingly: formation, growth and exploration. In particular, at the last stage of training, we observe that only a tiny portion of training examples is used to optimize the model. Given the data-hungry nature of ViTs, we thus ask a simple but important question: is it possible to provide abundant “effective” training examples at EVERY stage of training? To address this issue, we need to address two critical questions, \ie, how to measure the “effectiveness” of individual training examples, and how to systematically generate enough number of “effective” examples when they are running out. To answer the first question, we find that the “difficulty” of training samples can be adopted as an indicator to measure the “effectiveness” of training samples. To cope with the second question, we propose to dynamically adjust the “difficulty” distribution of the training data in these evolution stages. To achieve these two purposes, we propose a novel data-centric ViT training framework to dynamically measure the “difficulty” of training samples and generate “effective” samples for models at different training stages. Furthermore, to further enlarge the number of “effective” samples and alleviate the overfitting problem in the late training stage of ViTs, we propose a patch-level erasing strategy dubbed PatchErasing. Extensive experiments demonstrate the effectiveness of the proposed data-centric ViT training framework and techniques.

arxiv情報

著者	Benjia Zhou,Pichao Wang,Jun Wan,Yanyan Liang,Fan Wang
発行日	2022-09-29 17:59:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Effective Vision Transformer Training: A Data-Centric Perspective

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー