VMamba: Visual State Space Model

要約

畳み込みニューラルネットワーク (CNN) とビジョントランスフォーマー (ViT) は、視覚表現学習の 2 つの最も人気のある基礎モデルです。
CNN は線形の複雑性を備えた顕著なスケーラビリティを示します。
画像解像度においては、二次関数の複雑さに直面しているにもかかわらず、ViT はフィッティング能力でそれらを上回っています。
詳細に検査すると、ViT がグローバルな受容野と動的重み付けを組み込むことにより、優れた視覚モデリングパフォーマンスを達成していることがわかります。
この観察は、計算効率を向上させながらこれらのコンポーネントを継承する新しいアーキテクチャを提案する動機となっています。
この目的を達成するために、私たちは最近導入された状態空間モデルからインスピレーションを得て、大域的な受容野を犠牲にすることなく線形の複雑さを実現する Visual State Space Model (VMamba) を提案します。
発生した方向依存の問題に対処するために、空間領域を横断し、因果関係のない視覚イメージを順序パッチシーケンスに変換するクロススキャンモジュール (CSM) を導入します。
広範な実験結果は、VMamba がさまざまな視覚認識タスクにわたって有望な機能を実証するだけでなく、画像解像度が増加するにつれて確立されたベンチマークよりも顕著な利点を示すことを実証しています。
ソースコードは https://github.com/MzeroMiko/VMamba で入手できます。

要約(オリジナル)

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.

arxiv情報

著者	Yue Liu,Yunjie Tian,Yuzhong Zhao,Hongtian Yu,Lingxi Xie,Yaowei Wang,Qixiang Ye,Yunfan Liu
発行日	2024-01-18 17:55:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VMamba: Visual State Space Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー