Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

要約

Vision Transformer (ViT) は、Transformer のエンコーダを活用して、画像をパッチに分割することでグローバル情報をキャプチャし、さまざまなコンピュータビジョンタスクにわたって優れたパフォーマンスを実現します。
ただし、ViT のセルフアテンションメカニズムは最初からグローバルコンテキストをキャプチャし、画像やビデオ内の隣接するピクセル間の固有の関係を見逃します。
トランスフォーマーは主にグローバル情報に重点を置き、ローカルの細かい詳細は無視します。
その結果、ViT には画像またはビデオデータセットのトレーニング中に誘導バイアスが不足します。
対照的に、畳み込みニューラルネットワーク (CNN) はローカルフィルターに依存しているため、固有の帰納的バイアスがあり、少ないデータで ViT よりも効率的かつ迅速に収束します。
このペーパーでは、ViT モデルのショートカットとして軽量の Depth-Wise Convolution モジュールを紹介し、Transformer ブロック全体をバイパスして、モデルが最小限のオーバーヘッドでローカル情報とグローバル情報の両方を確実に取得できるようにします。
さらに、パラメータを節約するために Depth-Wise Convolution モジュールを複数の Transformer ブロックに適用できるようにする 2 つのアーキテクチャバリアントを導入し、異なるカーネルを備えた独立した並列 Depth-Wise Convolution モジュールを組み込んでローカル情報の取得を強化します。
提案されたアプローチは、画像分類、物体検出、およびインスタンスセグメンテーションに関する ViT モデルのパフォーマンスを大幅に向上させます。特に小規模なデータセットでは、画像分類用の CIFAR-10、CIFAR-100、Tiny-ImageNet、および ImageNet で評価され、
COCO はオブジェクトの検出とインスタンスのセグメンテーションに使用します。
ソースコードは https://github.com/ZTX-100/Efficient_ViT_with_DW からアクセスできます。

要約(オリジナル)

The Vision Transformer (ViT) leverages the Transformer’s encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the models capture both local and global information with minimal overhead. Additionally, we introduce two architecture variants, allowing the Depth-Wise Convolution modules to be applied to multiple Transformer blocks for parameter savings, and incorporating independent parallel Depth-Wise Convolution modules with different kernels to enhance the acquisition of local information. The proposed approach significantly boosts the performance of ViT models on image classification, object detection and instance segmentation by a large margin, especially on small datasets, as evaluated on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet for image classification, and COCO for object detection and instance segmentation. The source code can be accessed at https://github.com/ZTX-100/Efficient_ViT_with_DW.

arxiv情報

著者	Tianxiao Zhang,Wenju Xu,Bo Luo,Guanghui Wang
発行日	2024-08-01 04:22:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー