Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

要約

ビジョントランスフォーマー (ViT) や ResNets など、コンピュータービジョンで現在人気のあるバックボーンは、2D 画像から世界を認識するようにトレーニングされています。
ただし、2D バックボーンの 3D 構造事前確率をより効果的に理解するために、自己教師あり事前トレーニングで既存の大規模 RGB-D データを活用して、これらの 3D 事前確率を 2D 学習済み特徴表現に埋め込む Mask3D を提案します。
3D 再構成またはマルチビュー対応を必要とする従来の 3D コントラスト学習パラダイムとは対照的に、私たちのアプローチは単純です。個々の RGB-D フレームで RGB および深度パッチをマスキングすることにより、プレテキスト再構成タスクを定式化します。
Mask3D が 3D 事前確率を強力な 2D ViT バックボーンに埋め込むのに特に効果的であることを実証し、セマンティックセグメンテーション、インスタンスセグメンテーション、オブジェクト検出など、さまざまなシーン理解タスクの表現学習の改善を可能にします。
実験によると、ScanNet、NYUv2、および Cityscapes の画像理解タスクに対する既存の自己教師あり 3D 事前トレーニングアプローチよりも、Mask3D が著しく優れており、ScanNet 画像セマンティックセグメンテーションの最先端の Pri3D に対して +6.5% mIoU の改善が見られます。

要約(オリジナル)

Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms existing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.

arxiv情報

著者	Ji Hou,Xiaoliang Dai,Zijian He,Angela Dai,Matthias Nießner
発行日	2023-02-28 16:45:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー