Structured Initialization for Vision Transformers

要約

畳み込みニューラルネットワーク（CNNS）は本質的に強力な誘導バイアスをエンコードし、小規模データセットで効果的な一般化を可能にします。
この論文では、この帰納的バイアスを、建築介入ではなく、初期化だけで統合することを提案します。
ここでの動機は、データアセットが小さいときに強力なCNNのようなパフォーマンスを享受できるVITを持つことですが、データが拡大するにつれてVITのようなパフォーマンスに拡大することができます。
私たちのアプローチは、ランダムなインパルスフィルターがCNN内の学習フィルターに対して相応のパフォーマンスを達成できるという経験的な結果によって動機付けられています。
現在のVITの初期化戦略を改善します。これは、通常、前処理されたモデルからの注意力を使用したり、構造を強制せずに注意力の分布に焦点を当てるなどの経験的ヒューリスティックに依存しています。
経験的結果は、私たちの方法が、Food-101、CIFAR-10、CIFAR-100、STL-10、花、ペットなど、多数の中小規模のベンチマークにわたって標準的なVIT初期化を大幅に上回っていることを示しています。
さらに、初期化戦略は、パフォーマンスの一貫した改善を伴うSwin TransformerやMLP-Mixerなどのさまざまな変圧器ベースのアーキテクチャに簡単に統合できます。

要約(オリジナル)

Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

arxiv情報

著者	Jianqiao Zheng,Xueqian Li,Hemanth Saratchandran,Simon Lucey
発行日	2025-05-26 13:42:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Structured Initialization for Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー