Fully Attentional Networks with Self-emerging Token Labeling

要約

最近の研究では、ビジョントランスフォーマー (ViT) が配布外のシナリオに対して堅牢であることが示されています。
特に、ViT バックボーンのファミリーである Fully Attendal Network (FAN) は、最先端の堅牢性を実現しています。
このペーパーでは、FAN モデルを再検討し、自己出現トークンラベリング (STL) フレームワークを使用して事前トレーニングを改善します。
私たちのメソッドには 2 段階のトレーニングフレームワークが含まれています。
具体的には、まず FAN トークンラベラー (FAN-TL) をトレーニングして意味的に意味のあるパッチトークンラベルを生成し、続いてトークンラベルと元のクラスラベルの両方を使用する FAN スチューデントモデルのトレーニングステージを行います。
提案された STL フレームワークを使用すると、FAN-L-Hybrid (7,730 万パラメーター) に基づく当社の最良のモデルは、ImageNet-1K および ImageNet-C で 84.8% のトップ 1 精度と 42.1% の mCE を達成し、新しい状態を確立しました。
-余分なデータを使用せずに ImageNet-A (46.1%) および ImageNet-R (56.6%) のアートを実現し、オリジナルの FAN 対応製品を大幅に上回りました。
提案されたフレームワークは、セマンティックセグメンテーションなどの下流タスクのパフォーマンスが大幅に向上し、対応するモデルと比較して堅牢性が最大 1.7% 向上していることも実証しています。
コードは https://github.com/NVlabs/STL で入手できます。

要約(オリジナル)

Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) – a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at https://github.com/NVlabs/STL.

arxiv情報

著者	Bingyin Zhao,Zhiding Yu,Shiyi Lan,Yutao Cheng,Anima Anandkumar,Yingjie Lao,Jose M. Alvarez
発行日	2024-01-08 12:14:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fully Attentional Networks with Self-emerging Token Labeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー