Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

要約

弱教師ありセマンティックセグメンテーション (WSSS) の研究では、画像クラスラベルが唯一の監視であることを前提として、典型的なパイプライン CNN、クラスアクティベーションマップ (CAM)、および改良点を改善するための多くの方向性が調査されています。
完全に監視された方法とのギャップは縮小されますが、このフレームワーク内でスプレッドをさらに緩和することはほとんどないようです.
一方、ビジョントランスフォーマー (ViT) に基づく WSSS メソッドは、CAM の有効な代替手段をまだ検討していません。
ViT 機能は、自己教師あり学習でシーンレイアウトとオブジェクト境界を保持することが示されています。
これらの調査結果を確認するために、自己教師ありメソッドのトランスフォーマーの利点が、グローバルマックスプーリング (GMP) によってさらに強化されることを証明します。GMP は、パッチ機能を利用して、ピクセルラベルの確率とクラスの確率をネゴシエートできます。
この作業は、CAM に基づいていない、ViT-PCM (ViT Patch-Class Mapping) と呼ばれる新しい WSSS 方式を提案しています。
エンドツーエンドの提示されたネットワークは、単一の最適化プロセス、洗練された形状、およびセグメンテーションマスクの適切なローカリゼーションで学習します。
私たちのモデルは、PascalVOC 2012 $val$ セットで $69.3\%$ mIoU を達成する、ベースライン疑似マスク (BPM) の最先端を上回っています。
他のすべてのアプローチよりも高い精度が得られますが、私たちのアプローチはパラメーターのセットが最小であることを示しています。
一言で言えば、私たちの方法の定量的および定性的な結果は、ViT-PCM が CNN-CAM ベースのアーキテクチャに代わる優れた代替手段であることを明らかにしています。

要約(オリジナル)

Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3\%$ mIoU on PascalVOC 2012 $val$ set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

arxiv情報

著者	Simone Rossetti,Damiano Zappia,Marta Sanzari,Marco Schaerf,Fiora Pirri
発行日	2022-10-31 15:32:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー