MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

要約

MobileViT（MobileViTv1）は、畳み込みニューラルネットワーク（CNN）とビジョン変換器（ViT）を組み合わせて、モバイルビジョンタスク用の軽量なモデルを作成するものである。MobileViTv1-blockは最先端の技術を駆使していますが、MobileViTv1-blockの中にある融合ブロックは、スケーリングの課題があり、複雑な学習タスクがあります。私たちは、スケーリングに対応し、学習タスクを簡略化したMobileViTv3-blockを作成するために、シンプルで効果的な融合ブロックの変更を提案します。MobileViTv3-XXS、XS、Sモデルを作成するために使用される我々の提案するMobileViTv3-blockは、ImageNet-1k、ADE20K、COCO、PascalVOC2012データセットにおいてMobileViTv1より優れた性能を発揮することがわかった。ImageNet-1Kでは、MobileViTv3-XXSとMobileViTv3-XSが、MobileViTv1-XXSとMobileViTv1-XSをそれぞれ2%と1.9%上回りました。最近発表されたMobileViTv2アーキテクチャでは、融合ブロックを削除し、線形複雑度変換器を用いることで、MobileViTv1よりも優れた性能を実現しています。我々は、MobileViTv2に我々の提案する融合ブロックを加え、MobileViTv3の0.5, 0.75, 1.0 モデルを作成します。これらのモデルは、ImageNet-1k、ADE20K、COCO、PascalVOC2012の各データセットにおいて、MobileViTv2と比較してより良い精度の数値を得ることができます。MobileViTv3-0.5とMobileViTv3-0.75は、ImageNet-1KデータセットでMobileViTv2-0.5とMobileViTv2-0.75よりそれぞれ2.1%と1.0%の性能が上回っています。また、セグメンテーション課題においては、MobileViTv3-1.0はMobileViTv2-1.0と比較して、ADE20Kデータセットで2.07%、PascalVOC2012データセットで1.1%のmIOUを達成しています。我々のコードと学習済みモデルは、https://github.com/micronDLA/MobileViTv3 で入手可能です。

要約(オリジナル)

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3

arxiv情報

著者	Shakti N. Wadekar,Abhishek Chaurasia
発行日	2022-10-06 14:19:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー