Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

要約

ドメイン一般化都市シーンセマンティックセグメンテーション (USSS) は、多様な都市シーンスタイルにわたる一般化されたセマンティック予測を学習することを目的としています。
ドメインギャップの課題とは異なり、USSS は、都市の景観、気象条件、照明、その他の要因の変化により、スタイルが大きく異なる可能性がある一方で、セマンティックカテゴリがさまざまな都市シーンで類似していることが多いという点で独特です。
既存のアプローチは通常、都市シーンの内容を学習するために畳み込みニューラルネットワーク (CNN) に依存しています。
この論文では、ドメイン汎用化された USSS 用の Content-enhanced Mask TransFormer (CMFormer) を提案します。
主なアイデアは、コンテンツ情報に対する Transformer セグメンテーションモデルの基本コンポーネントであるマスクアテンションメカニズムの焦点を強化することです。
これを達成するために、新しいコンテンツ強化マスクアテンションメカニズムを導入します。
通常、低解像度の画像特徴にはより堅牢なコンテンツ情報が含まれており、スタイルの変化の影響を受けにくいため、画像特徴とそのダウンサンプリングされた対応物の両方からマスククエリを学習します。
これらの機能は Transformer デコーダーに融合され、多重解像度のコンテンツ強化マスクアテンション学習スキームに統合されます。
さまざまなドメイン一般化された都市シーンのセグメンテーションデータセットに対して行われた広範な実験により、提案された CMFormer がドメイン一般化セマンティックセグメンテーションにおける既存の CNN ベースの手法を大幅に上回り、mIoU (和集合に対する平均積分) の観点から最大 14.00\% の改善を達成することが実証されました。
。
CMFormer のソースコードは、この \href{https://github.com/BiQiWHU/domain-generalized-urban-scene-segmentation}{リポジトリ} で利用可能になります。

要約(オリジナル)

Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. To achieve this, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, as lower-resolution image features usually contain more robust content information and are less sensitive to style variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods for domain-generalized semantic segmentation, achieving improvements of up to 14.00\% in terms of mIoU (mean intersection over union). The source code for CMFormer will be made available at this \href{https://github.com/BiQiWHU/domain-generalized-urban-scene-segmentation}{repository}.

arxiv情報

著者	Qi Bi,Shaodi You,Theo Gevers
発行日	2023-08-29 15:25:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー