Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

要約

3Dビジュアルグラウンディングは、自然言語による記述によって参照される3Dシーン内のオブジェクトをローカライズするタスクである。自律的な屋内ロボット工学からAR/VRに至るまで幅広い用途で、このタスクは最近人気が高まっている。3D視覚的グラウンディングに取り組むための一般的な定式化は、グラウンディング・バイ・デテクションであり、バウンディング・ボックスを介して定位が行われる。しかし、物理的なインタラクションを必要とする現実のアプリケーションでは、バウンディングボックスは物体の形状を十分に記述しない。そこで我々は、密な3D視覚的接地、すなわち参照ベースの3Dインスタンス分割の問題に取り組む。我々は、高密度3DグラウンディングネットワークConcreteNetを提案する。ConcreteNetは、困難な反復的インスタンス、すなわち同じ意味クラスのディストラクタを持つインスタンスに対するグラウンディング性能を向上させることを目的とした、4つの新規な独立モジュールを特徴とする。まず、インスタンス間の関係キューを曖昧性をなくすことを目的としたボトムアップのアテンティブフュージョンモジュールを導入し、次に、潜在空間における分離を誘導するためのコントラスト学習スキームを構築し、学習されたグローバルカメラトークンを介してビュー依存の発話を解決し、最後に、参照されるマスクの品質を向上させるためにマルチビューアンサンブルを採用する。ConcreteNetは、難易度の高いScanReferオンラインベンチマークで1位を獲得し、ICCV 3rd Workshop on Language for 3D Scenes ‘3D Object Localization’ challengeで優勝している。

要約(オリジナル)

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark and has won the ICCV 3rd Workshop on Language for 3D Scenes ‘3D Object Localization’ challenge.

arxiv情報

著者	Ozan Unal,Christos Sakaridis,Suman Saha,Luc Van Gool
発行日	2024-07-03 14:01:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー