Locality Guidance for Improving Vision Transformers on Tiny Datasets

要約

ビジョントランスフォーマー（VT）アーキテクチャーがコンピュータービジョンで流行している一方で、純粋なVTモデルは小さなデータセットではパフォーマンスが低下します。
この問題に対処するために、このペーパーでは、小さなデータセットでのVTのパフォーマンスを改善するためのローカリティガイダンスを提案します。
まず、画像を理解する上で非常に重要なローカル情報は、VTの自己注意メカニズムの柔軟性と本質的なグローバル性が高いため、限られたデータでは学習しにくいことを分析します。
ローカル情報を容易にするために、組み込みのローカルからグローバルへのCNN階層に触発された、すでにトレーニングされた畳み込みニューラルネットワーク（CNN）の機能を模倣することにより、VTのローカリティガイダンスを実現します。
デュアルタスク学習パラダイムでは、低解像度画像でトレーニングされた軽量CNNによって提供される局所性ガイダンスは、収束を加速し、VTのパフォーマンスを大幅に向上させるのに十分です。
したがって、ローカリティガイダンスアプローチは非常にシンプルで効率的であり、小さなデータセット上のVTの基本的なパフォーマンス強化方法として機能します。
広範な実験は、私たちの方法が小さなデータセットでゼロからトレーニングするときにVTを大幅に改善でき、さまざまな種類のVTおよびデータセットと互換性があることを示しています。
たとえば、提案された方法では、小さなデータセットでさまざまなVTのパフォーマンスを向上させ（たとえば、DeiTで13.07％、T2Tで8.98％、PVTで7.85％）、さらに強力なベースラインPVTv2を1.86％から79.30％向上させることができます。
小さなデータセットでのVTの可能性。
コードはhttps://github.com/lkhl/tiny-transformersで入手できます。

要約(オリジナル)

While the Vision Transformer (VT) architecture is becoming trendy in computer vision, pure VT models perform poorly on tiny datasets. To address this issue, this paper proposes the locality guidance for improving the performance of VTs on tiny datasets. We first analyze that the local information, which is of great importance for understanding images, is hard to be learned with limited data due to the high flexibility and intrinsic globality of the self-attention mechanism in VTs. To facilitate local information, we realize the locality guidance for VTs by imitating the features of an already trained convolutional neural network (CNN), inspired by the built-in local-to-global hierarchy of CNN. Under our dual-task learning paradigm, the locality guidance provided by a lightweight CNN trained on low-resolution images is adequate to accelerate the convergence and improve the performance of VTs to a large extent. Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets. Extensive experiments demonstrate that our method can significantly improve VTs when training from scratch on tiny datasets and is compatible with different kinds of VTs and datasets. For example, our proposed method can boost the performance of various VTs on tiny datasets (e.g., 13.07% for DeiT, 8.98% for T2T and 7.85% for PVT), and enhance even stronger baseline PVTv2 by 1.86% to 79.30%, showing the potential of VTs on tiny datasets. The code is available at https://github.com/lkhl/tiny-transformers.

arxiv情報

著者	Kehan Li,Runyi Yu,Zhennan Wang,Li Yuan,Guoli Song,Jie Chen
発行日	2022-07-20 16:41:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Locality Guidance for Improving Vision Transformers on Tiny Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー