Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

要約

我々は、新しいデータ生成パイプラインと学習フレームワークを導入することで、オープンボキャブラリーの3Dシーン理解に取り組む。本手法は、効果的なトレーニングのための3つの重要な要件、すなわち、正確な3D領域のセグメンテーション、包括的なテキスト記述、および十分なデータセットのスケールに対処する。最先端のオープンボキャブラリーイメージセグメンテーションモデルと領域を認識する視覚言語モデルを活用することで、高品質な3Dマスクとテキストのペアを生成する自動パイプラインを開発する。このパイプラインを複数の3Dシーンデータセットに適用することで、既存のデータセットよりも大幅に大きい5.6Mマスクテキストペアを持つ3万以上の注釈付きシーンのデータセットであるMosaic3D-5.6Mを作成する。このデータを基に、我々はMosaic3Dを提案する。Mosaic3Dは、対照学習で訓練された3Dエンコーダーと、オープンボキャブラリーの3Dセマンティックとインスタンスセグメンテーションのための軽量マスクデコーダーを組み合わせた基礎モデルである。我々のアプローチは、ScanNet200、Matterport3D、ScanNet++を含む、オープンボキャブラリーの3Dセマンティックセグメンテーションとインスタンスセグメンテーションのタスクにおいて、最先端の結果を達成しており、アブレーション研究により、我々の大規模なトレーニングデータの有効性が検証されている。

要約(オリジナル)

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

arxiv情報

著者	Junha Lee,Chunghyun Park,Jaesung Choe,Yu-Chiang Frank Wang,Jan Kautz,Minsu Cho,Chris Choy
発行日	2025-02-04 18:18:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー