MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

要約

大量のマルチビュービデオデータから堅牢かつスケーラブルな視覚表現を学習することは、コンピュータービジョンと自動運転において依然として課題です。
既存の事前トレーニング方法は、3D アノテーションを使用した高価な教師あり学習に依存してスケーラビリティを制限するか、単一フレームまたは単眼入力に焦点を当てて時間情報を無視します。
我々は、デュアルマスク画像モデリング（MIM）に基づく新しい事前トレーニングパラダイムであるMIM4Dを提案します。
MIM4D は、マスクされたマルチビュービデオ入力でトレーニングすることにより、空間的関係と時間的関係の両方を活用します。
連続シーンフローを使用して擬似 3D フィーチャを構築し、監視のためにそれらを 2D 平面に投影します。
高密度 3D 監視の欠如に対処するために、MIM4D は、幾何学的表現を学習するために 3D 体積微分可能レンダリングを採用してピクセルを再構築します。
私たちは、MIM4D が自動運転における視覚表現学習用の nuScenes データセット上で最先端のパフォーマンスを達成することを実証します。
これにより、BEV セグメンテーション (8.7% IoU)、3D オブジェクト検出 (3.5% mAP)、HD マップ構築 (1.4% mAP) など、複数の下流タスクにおける既存の手法が大幅に改善されます。
私たちの研究は、自動運転における大規模な学習表現のための新しい選択肢を提供します。
コードとモデルは https://github.com/hustvl/MIM4D でリリースされています。

要約(オリジナル)

Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information. We propose MIM4D, a novel pre-training paradigm based on dual masked image modeling (MIM). MIM4D leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that MIM4D achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including BEV segmentation (8.7% IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our work offers a new choice for learning representation at scale in autonomous driving. Code and models are released at https://github.com/hustvl/MIM4D

arxiv情報

著者	Jialv Zou,Bencheng Liao,Qian Zhang,Wenyu Liu,Xinggang Wang
発行日	2024-03-13 17:58:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー