Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

要約

マルチスケール学習はセマンティックセグメンテーションの中心です。
我々は、標準的なマルチスケール表現の有効受容野（ERF）を視覚化し、それを学習する際の2つのリスク、すなわちスケール不足とフィールドの不活性化を指摘します。
これらの問題に対処するために、新しいマルチスケール学習器である可変ウィンドウアテンション (VWA) が提示されます。
VWA は、ローカルウィンドウアテンション (LWA) を活用し、LWA をクエリウィンドウとコンテキストウィンドウに解きほぐし、クエリが複数のスケールで表現を学習できるようにコンテキストのスケールを変更できるようにします。
ただし、コンテキストを大規模ウィンドウに変更すると (比率 R を拡大)、メモリフットプリントと計算コストが大幅に増加する可能性があります (R^2 は LWA の 2 倍)。
私たちは、パフォーマンスを損なうことなく追加のコストをゼロにする、シンプルだがプロフェッショナルな再スケーリング戦略を提案します。
その結果、VWA は LWA と同じコストを使用して、ローカルウィンドウの受容制限を克服します。
さらに、VWA に依存し、さまざまな MLP を使用して、セマンティックセグメンテーションのマルチスケール表現を改善するために、マルチスケールデコーダ (MSD)、VWFormer を導入します。
VWFormer は、FPN や MLP デコーダなどの最も計算に適した MSD と同等の効率を実現しますが、パフォーマンスはどの MSD よりもはるかに優れています。
たとえば、UPerNet の計算のほぼ半分を使用する VWFormer は、ADE20K で 1.0% ～ 2.5% mIoU 優れています。
VWFormer を備えた Mask2Former は、オーバーヘッドがほとんどなく、最大 10G FLOP で 1.0% ～ 1.3% 向上します。

要約(オリジナル)

Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context’s scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive limitation of the local window. Furthermore, depending on VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, using nearly half of UPerNet’s computation, VWFormer outperforms it by 1.0%-2.5% mIoU on ADE20K. With little extra overhead, ~10G FLOPs, Mask2Former armed with VWFormer improves by 1.0%-1.3%.

arxiv情報

著者	Haotian Yan,Ming Wu,Chuang Zhang
発行日	2024-04-25 12:35:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー