Unifying 2D and 3D Vision-Language Understanding

要約

3Dビジョン言語学習の進歩は、大規模な3Dデータセットの希少性によって妨げられています。
既存の2D中心モデルと具体化されたシステムで利用可能な豊富な3D感覚データの間のギャップを埋める2Dおよび3D視覚言語の理解のための統一されたアーキテクチャであるUnivlgを紹介します。
当社のアプローチは、2Dと3Dの両方の視覚言語データの両方で、事前に訓練された2Dモデルとトレーニングからのほとんどのモデル重量を初期化します。
2Dおよび3Dモダリティで共有された新しい言語条件付きマスクデコーダーを、RGBおよびRGB-D画像の両方でオブジェクトを効果的に接地し、ボックスベースのアプローチを上回ることを提案します。
2Dと3Dの間のドメインギャップをさらに削減するために、2D対3Dリフティング戦略を組み込み、UNIVLGが2Dデータを利用して3Dパフォーマンスを向上させることができます。
これらのイノベーションにより、私たちのモデルは、複数の3Dビジョン言語接地タスクにわたって最先端のパフォーマンスを達成し、2Dビジョン言語学習からデータ制約の3Dドメインに進歩を転送する可能性を示しています。
さらに、2Dデータと3Dデータの両方での共同トレーニングは、2D機能を犠牲にすることなく、モダリティ全体のパフォーマンスを向上させます。
3Dメッシュの再構成と地上忠実なオブジェクトの提案への依存を削除することにより、UNIVLGは、現実的で具体化された評価の新しい基準を設定します。
コードと追加の視覚化は、https：//univlg.github.ioで入手できます。

要約(オリジナル)

Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at https://univlg.github.io .

arxiv情報

著者	Ayush Jain,Alexander Swerdlow,Yuzhou Wang,Sergio Arnaud,Ada Martin,Alexander Sax,Franziska Meier,Katerina Fragkiadaki
発行日	2025-03-20 16:24:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unifying 2D and 3D Vision-Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー