Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation

要約

LiDAR セマンティックセグメンテーションは、自動運転認識における基本的なタスクであり、各 LiDAR ポイントをセマンティックラベルに関連付けることで構成されます。
完全教師モデルはこのタスクに広く取り組んできましたが、スキャンごとにラベルが必要なため、領域が制限されたり、非現実的な量の高価なアノテーションが必要になったりします。
通常、LiDAR 点群と一緒に記録されるカメラ画像は、汎用でデータセットに依存しない、広く利用可能な 2D 基礎モデルによって処理できます。
ただし、LiDAR の認識を向上させるために 2D データから知識を抽出すると、ドメイン適応の課題が生じます。
たとえば、古典的な透視投影は、それぞれのキャプチャ時間における両方のセンサー間の位置ずれによって生じる視差効果の影響を受けます。
私たちは、ラベルのない LiDAR 点群とカメラ画像から抽出された知識を活用する半教師あり学習セットアップを提案します。
ラベルなしスキャンでモデルを自己監視するために、補助 NeRF ヘッドを追加し、カメラの視点からラベルなしのボクセルフィーチャ上に光線をキャストします。
NeRF ヘッドは、ピクセルセマンティクスのレンダリングに使用される、サンプリングされた各レイ位置での密度とセマンティクスロジットを予測します。
同時に、カメラ画像を使用して Segment-Anything (SAM) 基盤モデルをクエリし、ラベルのない汎用マスクのセットを生成します。
マスクを LiDAR からレンダリングされたピクセルセマンティクスと融合して、ピクセル予測を監視する疑似ラベルを生成します。
推論中に NeRF ヘッドを削除し、LiDAR のみでモデルを実行します。
nuScenes、SemanticKITTI、ScribbleKITTI という 3 つの公開 LiDAR セマンティックセグメンテーションベンチマークでアプローチの有効性を示します。

要約(オリジナル)

LiDAR Semantic Segmentation is a fundamental task in autonomous driving perception consisting of associating each LiDAR point to a semantic label. Fully-supervised models have widely tackled this task, but they require labels for each scan, which either limits their domain or requires impractical amounts of expensive annotations. Camera images, which are generally recorded alongside LiDAR pointclouds, can be processed by the widely available 2D foundation models, which are generic and dataset-agnostic. However, distilling knowledge from 2D data to improve LiDAR perception raises domain adaptation challenges. For example, the classical perspective projection suffers from the parallax effect produced by the position shift between both sensors at their respective capture times. We propose a Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images. To self-supervise our model on the unlabeled scans, we add an auxiliary NeRF head and cast rays from the camera viewpoint over the unlabeled voxel features. The NeRF head predicts densities and semantic logits at each sampled ray location which are used for rendering pixel semantics. Concurrently, we query the Segment-Anything (SAM) foundation model with the camera image to generate a set of unlabeled generic masks. We fuse the masks with the rendered pixel semantics from LiDAR to produce pseudo-labels that supervise the pixel predictions. During inference, we drop the NeRF head and run our model with only LiDAR. We show the effectiveness of our approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

arxiv情報

著者	Xavier Timoneda,Markus Herb,Fabian Duerr,Daniel Goehring,Fisher Yu
発行日	2024-11-05 10:13:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー