Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder

要約

この研究では、Neural Radiance Field (NeRF) におけるマスクされたマルチモーダル再構築を介した転送可能なマルチモーダル知覚表現学習のための統合された自己教師あり事前トレーニングフレームワーク、すなわち NeRF-Supervised Masked AutoEncoder (NS-MAE) を提案します。
具体的には、特定のビュー方向と位置を条件として、破損したマルチモーダル入力信号、つまり LiDAR 点群と画像から抽出されたマルチモーダルエンベディングが、ニューラルレンダリングを介して投影されたマルチモーダル特徴マップにレンダリングされます。
次に、元のマルチモーダル信号は、レンダリングされたマルチモーダル特徴マップの再構成ターゲットとして機能し、自己教師あり表現学習を可能にします。
広範な実験により、NS-MAE を介して学習された表現は、多様な 3D 知覚下流タスク (3D オブジェクト検出および BEV マップセグメンテーション) における多様なマルチモーダルおよびシングルモーダル (カメラのみおよび Lidar のみ) の知覚モデルに対して有望な転送可能性を示すことが示されています。
さまざまな量の微調整されたラベル付きデータ。
さらに、我々は、NS-MAEがマスクされたオートエンコーダーのメカニズムと神経放射場の両方の相乗効果を享受していることを経験的に発見しました。
この研究が、自律エージェントのためのより一般的なマルチモーダル表現学習の探求を刺激することができることを願っています。

要約(オリジナル)

This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning via masked multi-modal reconstruction in Neural Radiance Field (NeRF), namely NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on certain view directions and locations, multi-modal embeddings extracted from corrupted multi-modal input signals, i.e., Lidar point clouds and images, are rendered into projected multi-modal feature maps via neural rendering. Then, original multi-modal signals serve as reconstruction targets for the rendered multi-modal feature maps to enable self-supervised representation learning. Extensive experiments show that the representation learned via NS-MAE shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models on diverse 3D perception downstream tasks (3D object detection and BEV map segmentation) with diverse amounts of fine-tuning labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of both the mechanism of masked autoencoder and neural radiance field. We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.

arxiv情報

著者	Xiaohao Xu
発行日	2023-12-06 05:20:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー