Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment

要約

ディープニューラルネットワークモデルは、クローズドセットの設定でトレーニングされ、フルラベルを使用して、3Dシーンの理解において顕著な進歩を遂げました。
ただし、主要なボトルネックは、これらのモデルには、多様な現実世界のアプリケーションでトレーニングカテゴリを超えて目に見えない新しいクラスを認識する能力がないことです。
したがって、特にラベルがかなり不足している状況では、3Dポイントクラウドセグメンテーションと検出の両方に同時に適用できるフレームワークが緊急に必要です。
この作品は、ラベル付きのシーンが非常に限られているときに3Dシーンの理解を扱うための一般化された簡単なフレームワークを提示します。
事前に訓練されたビジョン言語モデルから新しいカテゴリの知識を抽出するために、階層的な特徴を調整した事前トレーニングおよび知識蒸留戦略を提案して、有意義な情報を大規模なビジョン言語モデルから抽出および蒸留します。
– タスクを理解する語彙シーン。
潜在的なインスタンスの識別を促進し、効率を保証するために、ポイントクラウドの監視されていない地域レベルのセマンティックコントラスト学習スキームを提案します。
限られた再構成の場合、WS3D ++と呼ばれる提案されたアプローチは、セマンティックセグメンテーションとインスタンスセグメンテーションのタスクの両方で、大規模なスキャネットベンチマークで1位にランクされています。
屋内と屋外の両方のシーンを使用した広範な実験により、データ効率の良い学習とオープンワールドの少数の学習の両方において、アプローチの有効性が実証されました。
このコードは、https：//drive.google.com/drive/folders/1m58v-ptr8dbewd296zjkng_m2qq-mtap？usp = sharingで公開されています。

要約(オリジナル)

Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world applications. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. The code is made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.

arxiv情報

著者	Kangcheng Liu,Yong-Jin Liu,Baoquan Chen
発行日	2025-02-19 09:52:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー