FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

要約

最近の変換器と畳み込み設計の融合により、モデルの精度と効率が着実に向上しています。
この作業では、最先端の遅延と精度のトレードオフを実現するハイブリッドビジョントランスフォーマーアーキテクチャである FastViT を紹介します。
この目的のために、FastViT のビルディングブロックである新しいトークンミキシングオペレーターである RepMixer を導入します。これは、構造的な再パラメーター化を使用して、ネットワーク内のスキップ接続を削除することでメモリアクセスコストを削減します。
さらに、トレーニング時間のオーバーパラメーター化と大規模なカーネル畳み込みを適用して精度を高め、これらの選択がレイテンシーに与える影響が最小限であることを経験的に示しています。
私たちはそれを示します-私たちのモデルはCMT、最近の最先端のハイブリッド変圧器アーキテクチャより3.5倍速く、EfficientNetより4.9倍速く、モバイルデバイスでConvNeXtより1.9倍速く、ImageNetデータセットで同じ精度です。
.
同様のレイテンシーで、モデルは ImageNet で MobileOne よりも 4.2% 優れたトップ 1 精度を取得します。
私たちのモデルは、画像の分類、検出、セグメンテーション、3D メッシュ回帰などのいくつかのタスクで競合するアーキテクチャよりも一貫して優れており、モバイルデバイスとデスクトップ GPU の両方でレイテンシが大幅に改善されています。
さらに、私たちのモデルは、配布されていないサンプルや破損に対して非常に堅牢であり、競合する堅牢なモデルよりも優れています。

要約(オリジナル)

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that – our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks — image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models.

arxiv情報

著者	Pavan Kumar Anasosalu Vasu,James Gabriel,Jeff Zhu,Oncel Tuzel,Anurag Ranjan
発行日	2023-03-24 17:58:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー