Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

要約

トランスフォーマーベースのセグメンテーション手法は、高解像度の画像を扱う際の効率的な推論という課題に直面しています。
最近、Mamba や RWKV などのいくつかのリニアアテンションアーキテクチャが、長いシーケンスを効率的に処理できるため、大きな注目を集めています。
この作業では、これらのさまざまなアーキテクチャを調査することにより、効率的なセグメント何でもモデルを設計することに焦点を当てます。
具体的には、畳み込みと RWKV 演算を含む混合バックボーンを設計し、精度と効率の両方で最高の結果を実現します。
さらに、マルチスケールトークンを利用して高品質のマスクを取得する効率的なデコーダーを設計します。
私たちの手法を RWKV-SAM と呼びます。これは、SAM のようなモデルのシンプルで効果的で高速なベースラインです。
さらに、さまざまな高品質のセグメンテーションデータセットを含むベンチマークを構築し、このベンチマークを使用して 1 つの効率的かつ高品質のセグメンテーションモデルを共同トレーニングします。
ベンチマーク結果に基づくと、当社の RWKV-SAM は、トランスや他のリニアアテンションモデルと比較して、効率とセグメンテーション品質において優れたパフォーマンスを実現します。
たとえば、同じスケールの変圧器モデルと比較して、RWKV-SAM は 2 倍以上の高速化を実現し、さまざまなデータセットでより優れたセグメンテーションパフォーマンスを実現できます。
さらに、RWKV-SAM は、より優れた分類とセマンティックセグメンテーションの結果により、最近のビジョン Mamba モデルよりも優れたパフォーマンスを発揮します。
コードとモデルは公開されます。

要約(オリジナル)

Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.

arxiv情報

著者	Haobo Yuan,Xiangtai Li,Lu Qi,Tao Zhang,Ming-Hsuan Yang,Shuicheng Yan,Chen Change Loy
発行日	2024-06-27 17:49:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー