Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

要約

Vision Transformer（ViT）は、Transformerのエンコーダを活用し、画像をパッチに分割することでグローバルな情報を取り込み、様々なコンピュータビジョンタスクで優れた性能を達成する。しかし、ViTの自己アテンション機構は、最初から大域的な文脈を捕捉するため、画像や動画における隣接画素間の固有の関係を見落とす。トランスフォーマーは主に大域的な情報を重視し、きめ細かい局所的な詳細は無視する。その結果、ViTは画像や動画のデータセット学習時に誘導バイアスを欠く。対照的に、畳み込みニューラルネットワーク（CNN）は、局所フィルタに依存するため、固有の帰納的バイアスを持ち、少ないデータでViTよりも効率的で収束が早い。本論文では、ViTモデルのショートカットとして、Transformerブロック全体をバイパスし、最小限のオーバーヘッドでローカルとグローバルの両方の情報を確実に取り込む、軽量のDepth-Wise Convolutionモジュールを紹介する。さらに、パラメータを節約するために、複数のTransformerブロックにDepth-Wise Convolutionモジュールを適用できるようにし、局所情報の取得を強化するために、異なるカーネルを持つ独立した並列Depth-Wise Convolutionモジュールを組み込むという、2つのアーキテクチャのバリエーションを紹介する。提案手法は、CIFAR-10、CIFAR-100、Tiny-ImageNet、ImageNetで画像分類を、COCOで物体検出とインスタンス分割を評価した結果、特に小規模なデータセットにおいて、画像分類、物体検出、インスタンス分割におけるViTモデルの性能を大幅に向上させる。ソースコードはhttps://github.com/ZTX-100/Efficient_ViT_with_DW。

要約(オリジナル)

The Vision Transformer (ViT) leverages the Transformer’s encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the models capture both local and global information with minimal overhead. Additionally, we introduce two architecture variants, allowing the Depth-Wise Convolution modules to be applied to multiple Transformer blocks for parameter savings, and incorporating independent parallel Depth-Wise Convolution modules with different kernels to enhance the acquisition of local information. The proposed approach significantly boosts the performance of ViT models on image classification, object detection and instance segmentation by a large margin, especially on small datasets, as evaluated on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet for image classification, and COCO for object detection and instance segmentation. The source code can be accessed at https://github.com/ZTX-100/Efficient_ViT_with_DW.

arxiv情報

著者	Tianxiao Zhang,Wenju Xu,Bo Luo,Guanghui Wang
発行日	2024-08-02 10:05:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー