Making Vision Transformers Truly Shift-Equivariant

要約

コンピュータービジョンタスクでは、Vision Transformers (ViT) が頼りになるディープネットアーキテクチャの 1 つになりました。
畳み込みニューラルネットワーク (CNN) からインスピレーションを受けているにもかかわらず、ViT は依然として入力画像の小さな変化に敏感です。
これに対処するために、トークン化、セルフアテンション、パッチマージ、位置エンコーディングなど、ViT の各モジュールに新しい設計を導入しました。
私たちが提案するモジュールを使用すると、理論と実践の両方で、Swin、SwinV2、MViTv2、および CvT という 4 つの十分に確立されたモデルで真にシフト等変な ViT を実現します。
経験的に、これらのモデルを画像分類とセマンティックセグメンテーションでテストし、100% のシフトの一貫性を維持しながら、3 つの異なるデータセットにわたって競争力のあるパフォーマンスを達成しました。

要約(オリジナル)

For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.

arxiv情報

著者	Renan A. Rojas-Gomez,Teck-Yian Lim,Minh N. Do,Raymond A. Yeh
発行日	2023-05-25 17:59:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Making Vision Transformers Truly Shift-Equivariant

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー