From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

要約

知識蒸留 (KD) では、教師の予測ロジットをソフトラベルとして使用して生徒を指導しますが、セルフ KD では実際の教師がソフトラベルを要求する必要はありません。
この作業では、一般的な KD 損失を正規化 KD (NKD) 損失と、ターゲットクラス (画像のカテゴリ) とユニバーサル自己知識蒸留 (USKD) という名前の非ターゲットクラスの両方に対してカスタマイズされたソフトラベルに分解および再編成することにより、2 つのタスクの定式化を統合します。
）。
KD 損失を分解し、そこからの非ターゲット損失により、生徒の非ターゲットロジットが教師のものと一致するように強制されますが、2 つの非ターゲットロジットの合計が異なるため、同一になることができません。
NKD は、非ターゲットロジットを正規化し、その合計を均等化します。
一般に、蒸留損失に対してソフトラベルをより効果的に使用するために、KD および自己 KD に使用できます。
USKD は、教師なしでターゲットクラスと非ターゲットクラスの両方に対してカスタマイズされたソフトラベルを生成します。
生徒のターゲットロジットをソフトターゲットラベルとして平滑化し、中間特徴のランクを使用して Zipf の法則に基づいてソフトな非ターゲットラベルを生成します。
教師を持つ KD の場合、当社の NKD は CIFAR-100 および ImageNet データセットで最先端のパフォーマンスを達成し、ResNet-34 教師を使用した場合の ResNet18 の ImageNet Top-1 精度を 69.90% から 71.96% に向上させます。
教師なしの自己 KD の場合、USKD は、ごくわずかな追加時間とメモリコストで CNN モデルと ViT モデルの両方に効果的に適用できる最初の自己 KD 手法であり、1.17% などの新しい最先端の結果が得られます。
ImageNet for MobileNet と DeiT-Tiny ではそれぞれ 0.55% の精度が向上しました。
コードは https://github.com/yzd-v/cls_KD で入手できます。

要約(オリジナル)

Knowledge Distillation (KD) uses the teacher’s prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image’s category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student’s non-target logits to match the teacher’s, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf’s law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at https://github.com/yzd-v/cls_KD.

arxiv情報

著者	Zhendong Yang,Ailing Zeng,Zhe Li,Tianke Zhang,Chun Yuan,Yu Li
発行日	2023-07-17 12:22:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー