Strip-MLP: Efficient Token Interaction for Vision MLP

要約

トークン相互作用操作は、異なる空間的位置間で情報を交換および集約するための MLP ベースのモデルのコアモジュールの 1 つです。
ただし、空間次元におけるトークンの相互作用の力は、特徴マップの空間解像度に大きく依存するため、特に特徴が小さな空間サイズにダウンサンプリングされる深い層では、モデルの表現能力が制限されます。
この問題に対処するために、3 つの方法でトークンの相互作用能力を強化する \textbf{Strip-MLP} と呼ばれる新しいメソッドを紹介します。
まず、ストリップ MLP レイヤーと呼ばれる新しい MLP パラダイムを導入します。これにより、トークンがストリップ間で他のトークンと対話できるようになり、行 (または列) 内のトークンが、隣接する異なる行 (または列) ストリップの情報集約に寄与できるようになります。
次に、\textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) が、空間特徴量の小ささによって引き起こされるパフォーマンスの低下を克服するために提案されています。
このモジュールにより、トークンは、フィーチャの空間サイズに依存せず、パッチ内およびパッチ間でより効果的に対話できるようになります。
最後に、ストリップ MLP 層に基づいて、ローカルリージョンでのトークンインタラクションパワーを高めるための新しい \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M} モジュール (LSMM) を提案します。
広範な実験により、Strip-MLP が小規模なデータセット上で MLP ベースのモデルのパフォーマンスを大幅に向上させ、ImageNet 上で同等またはそれ以上の結果が得られることが実証されました。
特に、Strip-MLP モデルは、既存の MLP ベースのモデルよりも、Caltech-101 で +2.44\%、CIFAR-100 で +2.16\% 高い平均トップ 1 精度を達成します。
ソースコードは、~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP} で入手できます。

要約(オリジナル)

Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model’s expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.

arxiv情報

著者	Guiping Cao,Shengda Luo,Wenjian Huang,Xiangyuan Lan,Dongmei Jiang,Yaowei Wang,Jianguo Zhang
発行日	2023-07-21 09:40:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Strip-MLP: Efficient Token Interaction for Vision MLP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー