Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

要約

マスクオートエンコーダー (MAE) は、2D と 3D の両方のコンピュータービジョンの自己教師あり学習で有望なパフォーマンスを示しています。
ただし、既存の MAE スタイルの方法は、2D と 3D の間の暗黙のセマンティックおよび幾何学的相関を無視する、単一のモダリティ、つまり画像または点群のデータからしか学習できません。
このホワイトペーパーでは、2D モダリティが 3D マスク自動エンコードにどのように役立つかを探り、自己教師あり 3D ポイントクラウドの事前トレーニング用の 2D-3D ジョイント MAE フレームワークである Joint-MAE を提案します。
Joint-MAE は、入力された 3D ポイントクラウドとその投影された 2D 画像をランダムにマスクし、2 つのモダリティのマスクされた情報を再構築します。
クロスモーダルインタラクションを改善するために、2 つの階層型 2D-3D 埋め込みモジュール、ジョイントエンコーダー、およびモーダル共有およびモデル固有のデコーダーを備えたジョイントデコーダーによって JointMAE を構築します。
これに加えて、3D 表現の学習を促進する 2 つのクロスモーダル戦略をさらに導入します。これは、2D-3D セマンティックキューのローカルアラインメントアテンションメカニズムと、2D-3D 幾何学的制約のクロス再構成損失です。
事前トレーニングパラダイムにより、Joint-MAE は複数のダウンストリームタスクで優れたパフォーマンスを達成します。たとえば、ModelNet40 の線形 SVM で 92.4% の精度、ScanObjectNN の最も難しい分割で 86.07% の精度です。

要約(オリジナル)

Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.

arxiv情報

著者	Ziyu Guo,Xianzhi Li,Pheng Ann Heng
発行日	2023-02-27 17:56:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー