Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

要約

Vision Transformer（ViT）は、画像処理で人気が高まっています。
具体的には、ViTでのテスト時間適応（TTA）の有効性を調査します。これは、テスト時間中の予測をそれ自体で修正するために登場した手法です。
まず、ViT-B16およびViT-L16でのさまざまなテスト時間適応アプローチのベンチマークを行います。
TTAはViTで効果的であり、適切な損失関数を使用する場合、事前の慣例（変調パラメータを適切に選択する）は必要ないことが示されています。
観察に基づいて、クラス条件付き特徴アラインメント（CFA）と呼ばれる新しいテスト時間適応方法を提案します。これは、オンラインでのソースとターゲット間の隠された表現のクラス条件付き分布の違いと全体の分布の違いの両方を最小化します。
マナー。
一般的な破損（CIFAR-10-C、CIFAR-100-C、およびImageNet-C）とドメイン適応（数字データセットとImageNet-Sketch）に関する画像分類タスクの実験は、CFAがさまざまなデータセットの既存のベースラインを安定して上回っていることを示しています。
また、ResNet、MLP-Mixer、およびいくつかのViTバリアント（ViT-AugReg、DeiT、およびBeiT）で実験することにより、CFAがモデルに依存しないことを確認します。
BeiTバックボーンを使用すると、CFAはImageNet-Cで19.8％のトップ1エラー率を達成し、既存のテスト時間適応ベースラインを44.0％上回ります。
これは、トレーニングフェーズを変更する必要のないTTAメソッドの中で最先端の結果です。

要約(オリジナル)

Vision Transformer (ViT) is becoming more popular in image processing. Specifically, we investigate the effectiveness of test-time adaptation (TTA) on ViT, a technique that has emerged to correct its prediction during test-time by itself. First, we benchmark various test-time adaptation approaches on ViT-B16 and ViT-L16. It is shown that the TTA is effective on ViT and the prior-convention (sensibly selecting modulation parameters) is not necessary when using proper loss function. Based on the observation, we propose a new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner. Experiments of image classification tasks on common corruption (CIFAR-10-C, CIFAR-100-C, and ImageNet-C) and domain adaptation (digits datasets and ImageNet-Sketch) show that CFA stably outperforms the existing baselines on various datasets. We also verify that CFA is model agnostic by experimenting on ResNet, MLP-Mixer, and several ViT variants (ViT-AugReg, DeiT, and BeiT). Using BeiT backbone, CFA achieves 19.8% top-1 error rate on ImageNet-C, outperforming the existing test-time adaptation baseline 44.0%. This is a state-of-the-art result among TTA methods that do not need to alter training phase.

arxiv情報

著者	Takeshi Kojima,Yutaka Matsuo,Yusuke Iwasawa
発行日	2022-06-28 12:14:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー