Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

要約

2D 視覚イメージから 3D セマンティック占有を正確に予測することは、自律エージェントが計画やナビゲーションのために周囲を理解できるようにするために不可欠です。
最先端の手法では通常、完全に監視されたアプローチが採用されており、高価な LiDAR センサーを通じて取得された膨大なラベル付きデータセットと、ヒューマンアノテーターによる細心の注意を払ったボクセル単位のラベル付けが必要です。
この注釈付けプロセスのリソース集約型の性質により、これらの方法の適用と拡張性が大幅に妨げられます。
高密度に注釈が付けられたデータへの依存を軽減するために、新しい半教師ありフレームワークを導入します。
私たちのアプローチは 2D 基礎モデルを活用して、重要な 3D シーンの幾何学的および意味論的な手がかりを生成し、より効率的なトレーニングプロセスを促進します。
私たちのフレームワークは注目すべき特性を示します: (1) 一般化可能性、2D-3D リフティングおよび 3D-2D 変換方法を含むさまざまな 3D セマンティックシーン完成アプローチに適用可能。
(2) SemanticKITTI および NYUv2 での実験を通じて実証された有効性。この方法では、ラベル付きデータの 10% のみを使用して完全教師ありパフォーマンスの最大 85% を達成します。
このアプローチは、データの注釈に関連するコストと労力を削減するだけでなく、3D セマンティック占有予測のためのカメラベースのシステムでより広範に採用される可能性を示しています。

要約(オリジナル)

Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.

arxiv情報

著者	Duc-Hai Pham,Duc-Dung Nguyen,Anh Pham,Tuan Ho,Phong Nguyen,Khoi Nguyen,Rang Nguyen
発行日	2025-01-09 12:45:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー