Multi-Modal Masked Pre-Training for Monocular Panoramic Depth Completion

要約

本論文では、パノラマ3Dカメラが複雑なシーンにおいて、しばしば欠落したデータで360{deg}の深度を生成することから、潜在的に価値のあるパノラマ深度補完（PDC）タスクを定式化する。その目的は、生の疎な深度やパノラマRGB画像から、密なパノラマ深度を回復することである。PDCタスクに対処するために、我々は深度と画像の両方を入力とするディープネットワークを訓練し、密なパノラマ深度の回復を行う。しかし、その非凸目的関数のため、ネットワークパラメータの最適化という困難な問題に直面する必要がある。この問題を解決するために、我々はM{^3}PT: multi-modal masked pre-trainingと呼ばれるシンプルかつ効果的なアプローチを提案する。具体的には、事前学習において、パノラマRGB画像と疎な深度のパッチを共有ランダムマスクで同時に覆い隠し、覆い隠された領域で疎な深度を再構成する。このように、マスク付きオートエンコーダで解決されるシングルモーダル課題ではなく、マルチモーダル課題においてマスク付き事前学習の有効性を示すのは、我々の知る限り初めてのことである。本手法では、事前学習のデコーダ部分を完全に破棄して微調整を行うMAEとは異なり、事前学習と微調整の段階には予測密度が異なるだけで、アーキテクチャ上の違いはなく、より便利で効果的な転移学習が可能になる可能性を秘めている。広範な実験により、3つのパノラマデータセットでM{^3}PTの有効性を検証する。特に、3つのベンチマークデータセットにおいて、RMSEで平均26.2%、MREで平均51.7%、MAEで平均49.7%、RMSElogで平均37.5%の改善を達成することができた。

要約(オリジナル)

In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360{\deg} depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M{^3}PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M$^{3}$PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M{^3}PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 26.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets.

arxiv情報

著者	Zhiqiang Yan,Xiang Li,Kun Wang,Zhenyu Zhang,Jun Li,Jian Yang
発行日	2022-07-06 10:15:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multi-Modal Masked Pre-Training for Monocular Panoramic Depth Completion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー