CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

要約

ビジョントランスフォーマー (ViT) は、トークンミキサーの強力なグローバルコンテキスト機能により、ニューラルネットワークの革命的な進歩を示します。
ただし、以前の研究ではかなりの努力が払われてきましたが、ペアワイズトークンアフィニティと複雑な行列演算により、リソースに制約のあるシナリオやモバイルデバイスなどのリアルタイムアプリケーションでの展開が制限されます。
このペーパーでは、モバイルアプリケーションで効率とパフォーマンスのバランスを実現する、CAS-ViT: 畳み込み加算セルフアテンションビジョントランスフォーマーを紹介します。
まず、トークンミキサーがグローバルなコンテキスト情報を取得できるかどうかは、空間ドメインやチャネルドメインなどの複数の情報の相互作用に依存すると主張します。
続いて、このパラダイムに従って新しい加算的類似度関数を構築し、Convolutional Additive Token Mixer (CATM) と呼ばれる効率的な実装を提示します。
この単純化により、計算オーバーヘッドが大幅に削減されます。
私たちは、画像分類、物体検出、インスタンスセグメンテーション、セマンティックセグメンテーションなど、さまざまな視覚タスクにわたって CAS-ViT を評価します。
GPU、ONNX、iPhone で行われた私たちの実験では、CAS-ViT が他の最先端のバックボーンと比較して競争力のあるパフォーマンスを実現し、効率的なモバイルビジョンアプリケーションの実行可能なオプションとして確立されていることが実証されました。
コードとモデルは \url{https://github.com/Tianfang-Zhang/CAS-ViT} から入手できます。

要約(オリジナル)

Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer’s powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we construct a novel additive similarity function following this paradigm and present an efficient implementation named Convolutional Additive Token Mixer (CATM). This simplification leads to a significant reduction in computational overhead. We evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our experiments, conducted on GPUs, ONNX, and iPhones, demonstrate that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones, establishing it as a viable option for efficient mobile vision applications. Our code and model are available at: \url{https://github.com/Tianfang-Zhang/CAS-ViT}

arxiv情報

著者	Tianfang Zhang,Lei Li,Yang Zhou,Wentao Liu,Chen Qian,Xiangyang Ji
発行日	2024-08-07 11:33:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー