Towards Label-free Scene Understanding by Vision Foundation Models

要約

Contrastive Vision-Language Pre-training (CLIP) や Segment Anything (SAM) などの視覚基盤モデルは、画像分類およびセグメンテーションタスクにおいて優れたゼロショットパフォーマンスを実証しています。
ただし、ラベルなしでシーンを理解するための CLIP と SAM の組み込みはまだ検討されていません。
この論文では、ネットワークがラベル付きデータなしで 2D および 3D 世界を理解できるようにするビジョン基盤モデルの可能性を調査します。
主な課題は、非常にノイズの多い擬似ラベルの下でネットワークを効果的に監視することにあります。擬似ラベルは CLIP によって生成され、2D ドメインから 3D ドメインへの伝播中にさらに悪化します。
これらの課題に取り組むために、CLIP と SAM の長所を活用して 2D ネットワークと 3D ネットワークを同時に監視する新しいクロスモダリティノイズ監視 (CNS) 方法を提案します。
特に、2D ネットワークと 3D ネットワークを同時トレーニングするために予測一貫性正則化を導入し、SAM の堅牢な特徴表現を使用してネットワークの潜在空間一貫性をさらに課します。
屋内および屋外のさまざまなデータセットに対して行われた実験により、2D および 3D のオープン環境を理解する際のこの手法の優れたパフォーマンスが実証されました。
当社の 2D および 3D ネットワークは、ScanNet 上で 28.4% および 33.5% の mIoU でラベルフリーのセマンティックセグメンテーションを実現し、それぞれ 4.7% および 7.9% 向上しました。
また、nuScenes データセットのパフォーマンスは 26.8% で、6% 改善されました。
コードは公開されます (https://github.com/runnanchen/Label-Free-Scene-Understanding)。

要約(オリジナル)

Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks’ latent space consistency using the SAM’s robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively. And for nuScenes dataset, our performance is 26.8% with an improvement of 6%. Code will be released (https://github.com/runnanchen/Label-Free-Scene-Understanding).

arxiv情報

著者	Runnan Chen,Youquan Liu,Lingdong Kong,Nenglun Chen,Xinge Zhu,Yuexin Ma,Tongliang Liu,Wenping Wang
発行日	2023-06-06 17:57:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Label-free Scene Understanding by Vision Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー