Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation

要約

少数の3Dポイントクラウドセグメンテーション（FS-PCS）は、最小限の注釈付きサポートサンプルで新しいカテゴリをセグメント化するモデルを一般化することを目的としています。
既存のFS-PCSメソッドは有望であることを示していますが、主に単峰性ポイントクラウド入力に焦点を当て、マルチモーダル情報を活用する潜在的な利点を見落としています。
この論文では、マルチモーダルFS-PCSセットアップを導入して、テキストラベルと潜在的に利用可能な2D画像モダリティを利用することにより、このギャップに対処します。
この習慣の簡単なセットアップでは、複数のモダリティからの補完情報を効果的に活用するモデルであるマルチモーダルの少数のショットSEGNET（MM-FSS）を提示します。
MM-FSSは、2つのヘッドを持つ共有バックボーンを使用して、インターモーダルと非モーダルの視覚的特徴を抽出し、テキストの埋め込みを生成するために前処理されたテキストエンコーダを採用しています。
マルチモーダル情報を完全に活用するために、マルチモーダル相関（MCF）モジュールを提案してマルチモーダル相関を生成し、マルチモーダルセマンティックフュージョン（MSF）モジュールを生成して、テキスト認識セマンティックガイダンスを使用して相関を改善します。
さらに、トレーニングバイアスを緩和するためのシンプルで効果的なテスト時間適応クロスモーダルキャリブレーション（TACC）手法を提案し、一般化をさらに改善します。
S3DISおよびSCANNETデータセットの実験結果は、私たちの方法によって達成された大幅なパフォーマンスの改善を示しています。
私たちのアプローチの有効性は、FS-PCSの一般的に無視された自由モダリティを活用することの利点を示しており、将来の研究に貴重な洞察を提供します。
このコードは、https：//github.com/zhaochongan/multimodality-3d-few-shotで入手できます

要約(オリジナル)

Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot

arxiv情報

著者	Zhaochong An,Guolei Sun,Yun Liu,Runjia Li,Min Wu,Ming-Ming Cheng,Ender Konukoglu,Serge Belongie
発行日	2025-02-26 12:33:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー