EfficientFormer: Vision Transformers at MobileNet Speed

要約

Vision Transformers (ViT) はコンピュータビジョンのタスクにおいて急速な進歩を見せ、様々なベンチマークで有望な結果を達成している。しかし、膨大な数のパラメータとモデル設計（例えば、注意メカニズム）により、ViTに基づくモデルは一般的に軽量畳み込みネットワークより何倍も遅い。そのため、リアルタイムアプリケーションへのViTの導入は、特にモバイルデバイスのようなリソースに制約のあるハードウェア上では困難である。最近では、ネットワークアーキテクチャの探索やMobileNetブロックとのハイブリッド設計により、ViTの計算量の削減が試みられていますが、推論速度はまだ満足のいくものではありません。このことから、トランスフォーマーはMobileNetと同等の性能を持ちながら、高速に動作することができるのか、という重要な疑問が生まれます。この問題に答えるため、まずViTベースのモデルで使用されているネットワークアーキテクチャと演算子を再検討し、非効率な設計を特定する。次に、設計のパラダイムとして、次元一貫性のある純粋なトランスフォーマー（MobileNetブロックなし）を導入します。最後に、EfficientFormerと呼ばれる一連の最終モデルを得るために、レイテンシー駆動型スリミングを行う。広範な実験により、EfficientFormerがモバイルデバイスにおける性能と速度において優れていることが示されました。我々の最速モデルであるEfficientFormer-L1は、iPhone 12（CoreMLでコンパイル）において、推論レイテンシわずか$1.6$ msでImageNet-1Kにおいて$79.2%$のトップ1精度を達成し、{ MobileNetV2$times 1.4$ ($1.6$ ms, $74.7%$ top-1）と同等の速度}、最大モデルEfficientFormer-L7はわずか$7.0$ msレイテンシで$83.3%$精度を獲得しています。我々の研究は、適切に設計されたトランスフォーマーが、モバイルデバイスにおいて高い性能を維持しながら極めて低いレイテンシーを達成できることを証明しています。

要約(オリジナル)

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves $79.2\%$ top-1 accuracy on ImageNet-1K with only $1.6$ ms inference latency on iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2$\times 1.4$ ($1.6$ ms, $74.7\%$ top-1),} and our largest model, EfficientFormer-L7, obtains $83.3\%$ accuracy with only $7.0$ ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

arxiv情報

著者	Yanyu Li,Geng Yuan,Yang Wen,Eric Hu,Georgios Evangelidis,Sergey Tulyakov,Yanzhi Wang,Jian Ren
発行日	2022-07-05 14:50:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

EfficientFormer: Vision Transformers at MobileNet Speed

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー