Self-Distillation for Further Pre-training of Transformers

要約

大量のラベルなしデータで大規模な変換モデルを事前トレーニングし、さまざまな下流タスク用にラベル付きデータセットで微調整することは、さまざまな視覚および自然言語処理タスクで成功する戦略であることが証明されています。
ただし、事前トレーニングと微調整のデータドメイン間で大きな不一致が存在する場合、事前トレーニングモデルの直接微調整は最適ではない可能性があります。
この問題に取り組むために、以前のいくつかの研究では、微調整する前にターゲットのラベルなしデータセットでモデルの事前トレーニングを継続する、さらなる事前トレーニング戦略が提案されています。
ただし、それらはすべて言語モデルのみに焦点を当てており、ターゲットのラベルなしデータでモデルの事前トレーニングを続けると、Vision Transformer が過学習に対して脆弱であることが経験的にわかります。
この制限に対処するために、さらなる事前トレーニング段階の正則化として自己蒸留を提案します。
具体的には、まず最初の事前トレーニング済みモデルをターゲットのラベルなしデータでさらに事前トレーニングし、それを自己蒸留の教師と見なします。
次に、生徒と同じ初期の事前トレーニング済みモデルを使用し、マスクされた自動エンコーディング目標で生徒を最適化しながら、その隠された表現が教師の表現に近づくように強制します。
画像およびテキスト分類タスクのさまざまなベンチマークデータセットで自己蒸留の有効性を経験的に検証します。
実験的に、私たちが提案した方法が関連するすべてのベースラインよりも優れていることを示します。
理論的には、さらなる事前トレーニングのための自己蒸留が下流タスクのパフォーマンスの向上にどのように役立つかを理解するために、単純化されたモデルを使用して提案された方法を分析します。

要約(オリジナル)

Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.

arxiv情報

著者	Seanie Lee,Minki Kang,Juho Lee,Sung Ju Hwang,Kenji Kawaguchi
発行日	2023-06-09 08:57:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-Distillation for Further Pre-training of Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー