Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

要約

基礎モデルは、画像分割、物体検出、視覚言語理解などの2Dおよび言語タスクにおいて大きな進歩を遂げた。しかしながら、3Dシーン表現学習を強化するための基礎モデルの可能性は、ドメインギャップのため、ほとんど未開拓のままである。本論文では、このギャップを解決するために、基礎モデルから得られた特徴、意味マスク、キャプションを使用して3Dモデルを事前学習する革新的な方法論Bridge3Dを提案する。特に、我々のアプローチは、これらのモデルからのセマンティックマスクを利用して、マスキングオートエンコーダのマスキングと再構成のプロセスをガイドする。この戦略により、ネットワークは前景のオブジェクトにより集中することができ、それによって3D表現の学習が強化されます。さらに、画像キャプションの基礎モデルを活用することで、シーンレベルでの3D-テキストギャップを埋める。学習された2次元表現とテキスト表現から3次元モデルへの知識抽出をさらに促進するために、基礎モデルを用いて高精度のオブジェクトレベルマスクとオブジェクトレベルの意味的テキスト情報を生成する新しい方法を導入する。本手法は、3次元物体検出やセマンティックセグメンテーションのタスクにおいて、最先端技術を凌駕するものである。例えば、ScanNetデータセットにおいて、我々の手法は、従来の最先端手法であるPiMAEを5.3%という大きな差で上回った。

要約(オリジナル)

Foundation models have made significant strides in 2D and language tasks such as image segmentation, object detection, and visual-language understanding. Nevertheless, their potential to enhance 3D scene representation learning remains largely untapped due to the domain gap. In this paper, we propose an innovative methodology Bridge3D to address this gap, pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our approach utilizes semantic masks from these models to guide the masking and reconstruction process in the masked autoencoder. This strategy enables the network to concentrate more on foreground objects, thereby enhancing 3D representation learning. Additionally, we bridge the 3D-text gap at the scene level by harnessing image captioning foundation models. To further facilitate knowledge distillation from well-learned 2D and text representations to the 3D model, we introduce a novel method that employs foundation models to generate highly accurate object-level masks and semantic text information at the object level. Our approach notably outshines state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, our method surpasses the previous state-of-the-art method, PiMAE, by a significant margin of 5.3%.

arxiv情報

著者	Zhimin Chen,Bing Li
発行日	2023-05-15 16:36:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー