Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

要約

この論文では、3D視覚的接地のための効率的なマルチレベルの畳み込みアーキテクチャを提案します。
従来の方法は、2段階またはポイントベースのアーキテクチャにより、リアルタイム推論の要件を満たすことが困難です。
3Dオブジェクト検出におけるマルチレベルの完全にスパースの畳み込みアーキテクチャの成功に触発され、この技術的なルートに従って新しい3Dビジュアル接地フレームワークを構築することを目指しています。
ただし、3Dの視覚的接地タスクのように、3Dシーンの表現はテキスト機能と深く相互作用する必要があります。ボクセル機能の大量により、この相互作用にはまばらな畳み込みベースのアーキテクチャは非効率的です。
この目的のために、段階的な領域の剪定とターゲットの完了により、3Dシーンの表現とテキスト機能を効率的に融合させるために、テキスト誘導剪定（TGP）と完了ベースの追加（CBA）を提案します。
具体的には、TGPは3Dシーンの表現を繰り返して控えめにし、したがって、ボクセル機能を横断的にテキスト機能と効率的に相互作用させます。
繊細な幾何学的情報への剪定の影響を軽減するために、CBAは、無視できる計算オーバーヘッドでボクセルの完了によってオーバープルーの領域を適応的に固定します。
以前の単一段階の方法と比較して、我々の方法は最高の推論速度を達成し、以前の最速の方法を100 \％FPSで上回ります。
また、私たちの方法は、2段階の方法と比較して最先端の精度を達成します。ScanReferのACC@0.5の$+1.13 $ $ LEAD、NR3DとSR3Dでそれぞれ$+2.6 $および$+3.2 $のリードがあります。
このコードは、\ href {https://github.com/gwxuan/tsp3d} {https://github.com/gwxuan/tsp3d}で利用できます。

要約(オリジナル)

In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with $+1.13$ lead of Acc@0.5 on ScanRefer, and $+2.6$ and $+3.2$ leads on NR3D and SR3D respectively. The code is available at \href{https://github.com/GWxuan/TSP3D}{https://github.com/GWxuan/TSP3D}.

arxiv情報

著者	Wenxuan Guo,Xiuwei Xu,Ziwei Wang,Jianjiang Feng,Jie Zhou,Jiwen Lu
発行日	2025-02-14 18:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー