Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

要約

トランスフォーマーはコンピュータービジョンと自然言語処理に革命をもたらしましたが、計算の複雑さが高いため、高解像度の画像処理や長いコンテキストの分析での応用は制限されています。
この論文では、NLP 分野で使用される RWKV モデルを視覚タスクに必要な修正を加えて適応させたモデルである Vision-RWKV (VRWKV) を紹介します。
Vision Transformer (ViT) と同様に、私たちのモデルは、まばらな入力を効率的に処理し、堅牢なグローバル処理機能を実証すると同時に、効果的にスケールアップして、大規模なパラメーターと広範なデータセットの両方に対応できるように設計されています。
その際立った利点は、空間集約の複雑さが軽減されることであり、これにより高解像度画像をシームレスに処理することに非常に優れ、ウィンドウ操作の必要性がなくなります。
私たちの評価では、VRWKV が画像分類において ViT のパフォーマンスを上回り、高解像度入力の処理速度が大幅に速く、メモリ使用量が少ないことが実証されました。
高密度の予測タスクでは、ウィンドウベースのモデルよりも優れたパフォーマンスを発揮し、同等の速度を維持します。
これらの結果は、視覚認識タスクのより効率的な代替手段としての VRWKV の可能性を強調しています。
コードは \url{https://github.com/OpenGVLab/Vision-RWKV} でリリースされています。

要約(オリジナル)

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT’s performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV’s potential as a more efficient alternative for visual perception tasks. Code is released at \url{https://github.com/OpenGVLab/Vision-RWKV}.

arxiv情報

著者	Yuchen Duan,Weiyun Wang,Zhe Chen,Xizhou Zhu,Lewei Lu,Tong Lu,Yu Qiao,Hongsheng Li,Jifeng Dai,Wenhai Wang
発行日	2024-03-07 15:43:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー