FFT-based Dynamic Token Mixer for Vision

要約

コンピュータビジョンにおいて、多頭自己注視型（MHSA）搭載モデルは注目すべき性能を発揮している。しかし、その計算量は入力特徴マップの画素数の2乗に比例するため、特に高解像度の画像を扱う場合には処理速度が遅くなる。この問題を回避するために、MHSAに代わる新しいタイプのトークンミキサーが提案されている。FFTベースのトークンミキサーは、グローバルな動作においてMHSAと似ているが、計算量はより少ない。しかし、その魅力的な特性にもかかわらず、FFTベースのトークンミキサーは、急速に進化しているMetaFormerアーキテクチャとの互換性という点で、慎重に検討されてきませんでした。ここでは、上記のギャップを埋めるために、ダイナミックフィルタと呼ばれる新しいトークンミキサと、ダイナミックフィルタを用いた画像認識モデルであるDFFormerとCDFFormerを提案します。CDFFormerは、コンボリューションとMHSAを用いたハイブリッドアーキテクチャに迫る85.0%のTop-1精度を達成しました。その他、物体検出やセマンティックセグメンテーションなど、幅広い実験と分析により、最先端のアーキテクチャと競合できることが示されました。高解像度画像認識を扱う際のスループットとメモリ効率は、ConvFormerと大差なく、CAFormerよりはるかに優れています。この結果から、ダイナミックフィルタは、真剣に検討すべきトークンミキサーの選択肢の1つであることがわかる。コードは https://github.com/okojoalg/dfformer で公開されています。

要約(オリジナル)

Multi-head-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer, similar to MHSA in global operation but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called dynamic filter and DFFormer and CDFFormer, image recognition models using dynamic filters to close the gaps above. CDFFormer achieved a Top-1 accuracy of 85.0%, close to the hybrid architecture with convolution and MHSA. Other wide-ranging experiments and analysis, including object detection and semantic segmentation, demonstrate that they are competitive with state-of-the-art architectures; Their throughput and memory efficiency when dealing with high-resolution image recognition is convolution and MHSA, not much different from ConvFormer, and far superior to CAFormer. Our results indicate that the dynamic filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer

arxiv情報

著者	Yuki Tatsunami,Masato Taki
発行日	2023-03-07 14:38:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

FFT-based Dynamic Token Mixer for Vision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー