POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

要約

自由形式の言語クエリの 3D グラウンディング、セグメンテーション、および検索を可能にすることを目的として、入力 2D 画像からオープン語彙の 3D セマンティックボクセル占有マップを予測するアプローチについて説明します。
これは、ターゲットタスクの 2D と 3D のあいまいさとオープンな語彙の性質により、注釈付きのトレーニングデータを 3D で取得することが難しいため、困難な問題です。
この研究の貢献は 3 つあります。
まず、オープン語彙の 3D 意味論的占有予測のための新しいモデルアーキテクチャを設計します。
このアーキテクチャは、占有予測および 3D 言語ヘッドを備えた 2D-3D エンコーダーで構成されています。
出力は、さまざまなオープン語彙タスクを可能にする 3D グラウンディング言語埋め込みの高密度ボクセルマップです。
次に、(i) 画像、(ii) 言語、(iii) LiDAR 点群の 3 つのモダリティを活用するトライモーダル自己教師あり学習アルゴリズムを開発し、事前にトレーニングされた強力なビジョン言語を使用して提案されたアーキテクチャのトレーニングを可能にします。
3D 手動言語注釈を必要とせずにモデルを作成できます。
最後に、いくつかのオープン語彙タスクに関する提案されたモデルの強みを定量的に実証します。既存のデータセットを使用したゼロショット 3D セマンティックセグメンテーション。
nuScenes の拡張として提案する小規模なデータセットを使用した、自由形式の言語クエリの 3D グラウンディングと検索。
プロジェクトページは https://vobecant.github.io/POP3D にあります。

要約(オリジナル)

We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.

arxiv情報

著者	Antonin Vobecky,Oriane Siméoni,David Hurych,Spyros Gidaris,Andrei Bursuc,Patrick Pérez,Josef Sivic
発行日	2024-01-17 18:51:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー