3D Vision and Language Pretraining with Large-Scale Synthetic Data

要約

3D Vision-Language Pre-training (3D-VLP) は、3D シーンと自然言語を橋渡しできる事前トレーニングモデルを提供することを目的としています。これは、身体化された知能にとって重要な技術です。
しかし、現在の 3D-VLP データセットは、主に 3D シーンの収集と注釈付けに多大な労力を要するため、シーンレベルの多様性が限られていることと、きめ細かい注釈が不十分であること (ScanScribe では 1.2K のシーンと 280K のテキスト注釈のみ) によって妨げられています。
これらの障害を克服するために、私たちは、10K の屋内シーンとオブジェクト、ビュー、部屋レベルでの 1M の説明を備えた包括的な合成シーンテキストコーパスである SynVL3D を構築しました。これには、多様なシーンデータ、豊富なテキスト説明、マルチグレイン 3D の利点があります。
テキストの関連付け、および収集コストの低さ。
SynVL3D の豊富なアノテーションを利用して、マルチグレインの事前トレーニングタスクで 3D と言語を調整するためのシンプルで統合された Transformer を事前トレーニングします。
さらに、ドメインのシフトに対処するために、下流のタスク微調整プロセスで合成ドメインから実際のドメインへの適応を提案します。
広範な実験を通じて、視覚的なグラウンディング、緻密なキャプション、質問応答などの下流タスクで最先端のパフォーマンスを達成することで、モデル設計の有効性を検証します。

要約(オリジナル)

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering.

arxiv情報

著者	Dejie Yang,Zhu Xu,Wentao Mo,Qingchao Chen,Siyuan Huang,Yang Liu
発行日	2024-07-08 16:26:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3D Vision and Language Pretraining with Large-Scale Synthetic Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー