VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

要約

人間は、遮られたオブジェクトやシーンの完全な 3D ジオメトリを簡単に想像できます。
この魅力的な能力は、認識と理解に不可欠です。
AI システムでこのような機能を有効にするために、2D 画像のみから完全な 3D ボリュメトリックセマンティクスを出力できる Transformer ベースのセマンティックシーン補完フレームワークである VoxFormer を提案します。
私たちのフレームワークは、深度推定からの可視ボクセルクエリと占有ボクセルクエリのスパースセットから開始し、スパースボクセルから密な 3D ボクセルを生成する高密度化ステージが続く 2 段階の設計を採用しています。
この設計の重要なアイデアは、2D 画像の視覚的特徴が、遮られた空間や空の空間ではなく、目に見えるシーン構造のみに対応するということです。
したがって、目に見える構造の特徴付けと予測から始める方が信頼性が高くなります。
疎クエリのセットを取得したら、マスクされたオートエンコーダー設計を適用して、自己注意によってすべてのボクセルに情報を伝達します。
SemanticKITTI での実験では、VoxFormer が最先端技術を凌駕し、ジオメトリで 20.0%、セマンティクスで 18.1% の相対的な改善が見られ、トレーニング中の GPU メモリが 16GB 未満に削減されることが示されています。
コードは https://github.com/NVlabs/VoxFormer で入手できます。

要約(オリジナル)

Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.

arxiv情報

著者	Yiming Li,Zhiding Yu,Christopher Choy,Chaowei Xiao,Jose M. Alvarez,Sanja Fidler,Chen Feng,Anima Anandkumar
発行日	2023-03-25 07:48:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー