Language-driven Open-Vocabulary 3D Scene Understanding

要約

オープンボキャブラリーシーンの理解は、注釈付きのラベルスペースを超えて、目に見えないカテゴリをローカライズして認識することを目的としています。
2D オープン語彙認識の最近のブレークスルーは、豊富な語彙概念を持つインターネットスケールの画像とテキストのペアデータによって大きく推進されています。
ただし、大規模な 3D テキストペアにアクセスできないため、この成功を 3D シナリオに直接移すことはできません。
この目的のために、事前にトレーニングされたビジョン言語 (VL) 基盤モデルでエンコードされた知識を、3D から多視点画像にキャプションを付けることで抽出することを提案します。これにより、3D とセマンティックリッチなキャプションを明示的に関連付けることができます。
さらに、キャプションから学習する粗い視覚意味表現を容易にするために、3D シーンとマルチビュー画像の間の幾何学的制約を活用して、階層的な 3D キャプションペアを設計します。
最後に、対照的な学習を採用することにより、モデルはオープン語彙タスクのために 3D とテキストを接続する言語認識埋め込みを学習します。
私たちの方法は、オープン語彙のセマンティックとインスタンスのセグメンテーションで 25.8% $\sim$ 44.7% hIoU および 14.5% $\sim$ 50.4% hAP$_{50}$ だけベースラインの方法よりも著しく優れているだけでなく、挑戦的な方法でロバストな転送可能性も示しています。
ゼロショットドメイン移管タスク。
コードは https://github.com/CVMI-Lab/PLA で入手できます。

要約(オリジナル)

Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to facilitate coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and 14.5% $\sim$ 50.4% hAP$_{50}$ on open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. Code will be available at https://github.com/CVMI-Lab/PLA.

arxiv情報

著者	Runyu Ding,Jihan Yang,Chuhui Xue,Wenqing Zhang,Song Bai,Xiaojuan Qi
発行日	2022-11-29 15:52:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language-driven Open-Vocabulary 3D Scene Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー