Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

要約

基礎モデルは、画像セグメンテーション、オブジェクト検出、視覚言語理解などの 2D および言語タスクで目覚ましい成果を達成しました。
ただし、3D シーン表現の学習を強化する可能性は、ドメインギャップの存在によりほとんど活用されていません。
この研究では、基盤モデルから取得した機能、セマンティックマスク、キャプションを使用して 3D モデルを事前トレーニングすることで、このギャップに対処する Bridge3D と呼ばれる革新的な方法論を提案します。
具体的には、私たちの方法では基礎モデルからのセマンティックマスクを使用して、マスクされたオートエンコーダーのマスキングと再構成のプロセスをガイドし、前景表現により集中した注意を向けることができます。
さらに、画像キャプション基盤モデルを使用してシーンレベルで 3D テキストのギャップを埋めることで、シーンレベルの知識の蒸留を促進します。
私たちは、基礎モデルからの高精度のオブジェクトレベルのマスクとセマンティックテキストデータを利用する革新的なオブジェクトレベルの知識蒸留手法を導入することで、この橋渡しの取り組みをさらに拡張します。
私たちの手法は、3D オブジェクト検出およびセマンティックセグメンテーションタスクにおける既存の最先端手法のパフォーマンスを大幅に上回っています。
たとえば、ScanNet データセットでは、Bridge3D はベースラインを 6.3% という顕著なマージンで改善します。
コードは https://github.com/Zhimin-C/Bridge3D で入手できます。

要約(オリジナル)

Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, enabling more focused attention on foreground representations. Moreover, we bridge the 3D-text gap at the scene level using image captioning foundation models, thereby facilitating scene-level knowledge distillation. We further extend this bridging effort by introducing an innovative object-level knowledge distillation method that harnesses highly accurate object-level masks and semantic text data from foundation models. Our methodology significantly surpasses the performance of existing state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, Bridge3D improves the baseline by a notable margin of 6.3%. Code will be available at: https://github.com/Zhimin-C/Bridge3D

arxiv情報

著者	Zhimin Chen,Longlong Jing,Yingwei Li,Bing Li
発行日	2023-11-02 15:21:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー