EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

要約

3D ビジュアルグラウンディングは、豊富なセマンティックコンポーネントを含む自由形式の自然言語記述によって言及された点群内のオブジェクトを見つけることを目的としています。
ただし、既存の方法では、すべての単語を結合する文レベルの特徴を抽出するか、単語レベルの情報を失うか、他の属性を無視するオブジェクト名に重点を置きます。
この問題を軽減するために、文内のテキスト属性を明示的に分離し、そのようなきめの細かい言語と点群オブジェクトの間で密な位置合わせを行う EDA を提示します。
具体的には、最初に、すべてのセマンティックコンポーネントのテキスト機能を生成するテキスト分離モジュールを提案します。
次に、2 つのモダリティ間の密な一致を監視する 2 つの損失を設計します: テキストの位置の配置とオブジェクトの意味の配置。
その上で、オブジェクト名のないオブジェクトの検索と、説明で参照されている補助オブジェクトの検索という 2 つの新しい視覚的グラウンディングタスクをさらに導入します。どちらも、モデルの密な配置能力を徹底的に評価できます。
実験を通じて、広く採用されている 2 つのビジュアルグラウンディングデータセット、ScanRefer と SR3D/NR3D で最先端のパフォーマンスを達成し、新しく提案された 2 つのタスクで絶対的なリーダーシップを獲得します。
コードは https://github.com/yanmin-wu/EDA で入手できます。

要約(オリジナル)

3D visual grounding aims to find the objects within point clouds mentioned by free-form natural language descriptions with rich semantic components. However, existing methods either extract the sentence-level features coupling all words, or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate this issue, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: the textual position alignment and object semantic alignment. On top of that, we further introduce two new visual grounding tasks, locating objects without object names and locating auxiliary objects referenced in the descriptions, both of which can thoroughly evaluate the model’s dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted visual grounding datasets , ScanRefer and SR3D/NR3D, and obtain absolute leadership on our two newly-proposed tasks. The code will be available at https://github.com/yanmin-wu/EDA.

arxiv情報

著者	Yanmin Wu,Xinhua Cheng,Renrui Zhang,Zesen Cheng,Jian Zhang
発行日	2022-09-29 17:00:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー