Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

要約

マスクオートエンコーダー (MAE) は、2D と 3D の両方のコンピュータービジョンの自己教師あり学習において有望なパフォーマンスを示しています。
ただし、既存の MAE スタイルの手法は、単一モダリティのデータ、つまり画像または点群のデータからしか学習できず、2D と 3D の間の暗黙的な意味論的および幾何学的相関が無視されます。
この論文では、2D モダリティが 3D マスクされた自動エンコーディングにどのように役立つかを検討し、自己教師あり 3D 点群事前トレーニングのための 2D-3D 統合 MAE フレームワークである Joint-MAE を提案します。
Joint-MAE は、入力 3D 点群とその投影された 2D 画像をランダムにマスクし、マスクされた 2 つのモダリティの情報を再構築します。
クロスモーダル相互作用を改善するために、2 つの階層型 2D-3D 埋め込みモジュール、ジョイントエンコーダー、およびモーダル共有およびモデル固有のデコーダーを備えたジョイントデコーダーによって JointMAE を構築します。
これに加えて、3D 表現の学習を促進する 2 つのクロスモーダル戦略をさらに導入します。これは、2D-3D セマンティックキューに対するローカルに調整された注意メカニズムと、2D-3D 幾何学的制約に対するクロス再構成損失です。
私たちの事前トレーニングパラダイムにより、Joint-MAE は複数の下流タスクで優れたパフォーマンスを達成します。たとえば、ModelNet40 上の線形 SVM では 92.4% の精度、ScanObjectNN の最も難しい分割では 86.07% の精度です。

要約(オリジナル)

Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.

arxiv情報

著者	Ziyu Guo,Renrui Zhang,Longtian Qiu,Xianzhi Li,Pheng-Ann Heng
発行日	2023-09-25 17:22:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー