3D Vision-Language Gaussian Splatting

要約

近年の3D再構成手法と視覚言語モデルの進歩により、マルチモーダル3Dシーン理解の開発が推進されており、これはロボット工学、自律走行、仮想現実／拡張現実における重要な応用分野である。しかし、現在のマルチモーダルなシーン理解アプローチは、視覚モダリティと言語モダリティの間のバランスを取ることなく、素朴に意味表現を3D再構成手法に埋め込んでおり、半透明または反射オブジェクトの満足のいかない意味ラスタライズや、色モダリティへのオーバーフィッティングにつながっている。これらの制限を緩和するために、我々は、異なる視覚モダリティと意味モダリティを適切に扱うソリューション、すなわち、シーン理解のための3次元視覚-言語ガウススプラッティングモデルを提案し、言語モダリティの表現学習に重点を置く。我々は、意味ラスタライゼーションを強化するために、平滑化された意味指標と共にモダリティフュージョンを用いた、新しいクロスモーダルラスタライザを提案する。また、既存のビューと合成されたビューの間の意味的整合性を向上させるために、カメラビューブレンディング技術を採用し、オーバーフィッティングを効果的に緩和する。広範な実験により、我々の手法が、オープン語彙の意味分割において、既存の手法を大きく上回る、最先端の性能を達成することが実証された。

要約(オリジナル)

Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.

arxiv情報

著者	Qucheng Peng,Benjamin Planche,Zhongpai Gao,Meng Zheng,Anwesa Choudhuri,Terrence Chen,Chen Chen,Ziyan Wu
発行日	2025-05-05 17:00:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

3D Vision-Language Gaussian Splatting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー