Enhancing Weakly Supervised Semantic Segmentation with Multi-modal Foundation Models: An End-to-End Approach

要約

セマンティックセグメンテーションはコンピュータビジョンの中核的な問題ですが、データアノテーションのコストが高いため、その広範な適用が妨げられています。
弱い教師ありセマンティックセグメンテーション (WSSS) は、部分的または不完全なラベルを使用する完全教師付き手法と比較して、広範なラベル付けに対するコスト効率の高い回避策を提供します。
既存の WSSS 手法では、オブジェクトの境界を学習することが難しく、セグメンテーションの結果が不十分になります。
私たちは、バウンディングボックス内のビジュアル基盤モデルを活用することで、これらの問題に対処する斬新で効果的なフレームワークを提案します。
2 段階の WSSS フレームワークを採用し、私たちが提案するネットワークは、擬似ラベル生成モジュールとセグメンテーションモジュールで構成されます。
最初の段階では、Segment Anything Model (SAM) を利用して高品質の疑似ラベルを生成します。
正確な境界線を描く問題を軽減するために、別の事前トレーニング済み基礎モデル (Grounding-DINO など) を利用して境界ボックス内に SAM を採用します。
さらに、分類にCLIPを採用することで、画像ラベルの監視を不要とします。
次に、第 2 段階では、生成された高品質の疑似ラベルを使用して、PASCAL VOC 2012 および MS COCO 2014 で最先端のパフォーマンスを実現する既製のセグメンターをトレーニングします。

要約(オリジナル)

Semantic segmentation is a core computer vision problem, but the high costs of data annotation have hindered its wide application. Weakly-Supervised Semantic Segmentation (WSSS) offers a cost-efficient workaround to extensive labeling in comparison to fully-supervised methods by using partial or incomplete labels. Existing WSSS methods have difficulties in learning the boundaries of objects leading to poor segmentation results. We propose a novel and effective framework that addresses these issues by leveraging visual foundation models inside the bounding box. Adopting a two-stage WSSS framework, our proposed network consists of a pseudo-label generation module and a segmentation module. The first stage leverages Segment Anything Model (SAM) to generate high-quality pseudo-labels. To alleviate the problem of delineating precise boundaries, we adopt SAM inside the bounding box with the help of another pre-trained foundation model (e.g., Grounding-DINO). Furthermore, we eliminate the necessity of using the supervision of image labels, by employing CLIP in classification. Then in the second stage, the generated high-quality pseudo-labels are used to train an off-the-shelf segmenter that achieves the state-of-the-art performance on PASCAL VOC 2012 and MS COCO 2014.

arxiv情報

著者	Elham Ravanbakhsh,Cheng Niu,Yongqing Liang,J. Ramanujam,Xin Li
発行日	2024-05-10 16:42:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Weakly Supervised Semantic Segmentation with Multi-modal Foundation Models: An End-to-End Approach

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー