FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Object Detection

要約

マルチセンサーモーダルフュージョンは、3D オブジェクト検出タスクにおいて強力な利点を実証しています。
ただし、マルチモーダルフィーチャを融合する既存の方法では、フィーチャを鳥瞰図空間に変換する必要があり、Z 軸上の特定の情報が失われる可能性があるため、パフォーマンスが低下します。
この目的を達成するために、我々は、融合エンコーディングモジュール内に変形可能な注意と残差構造を組み込んだ、FusionFormer と呼ばれる、新しいエンドツーエンドのマルチモーダル融合トランスフォーマーベースのフレームワークを提案します。
具体的には、均一なサンプリング戦略を開発することで、私たちの方法は 2D 画像と 3D ボクセルの特徴から自発的に簡単にサンプリングできるため、柔軟な適応性を活用し、特徴の連結プロセス中に鳥瞰図空間への明示的な変換を回避できます。
さらに、入力モダリティが欠落した場合にモデルの堅牢性を確保するために、特徴エンコーダーに残差構造を実装します。
人気の自動運転ベンチマークデータセットである nuScenes での広範な実験を通じて、私たちの手法は、テスト時間を増やすことなく、3D 物体検出タスクで 72.6% の mAP と 75.1% の NDS という最先端の単一モデルのパフォーマンスを達成しました。

要約(オリジナル)

Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features require transforming features into the bird’s eye view space and may lose certain information on Z-axis, thus leading to inferior performance. To this end, we propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer, that incorporates deformable attention and residual structures within the fusion encoding module. Specifically, by developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously, thus exploiting flexible adaptability and avoiding explicit transformation to the bird’s eye view space during the feature concatenation process. We further implement a residual structure in our feature encoder to ensure the model’s robustness in case of missing an input modality. Through extensive experiments on a popular autonomous driving benchmark dataset, nuScenes, our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.

arxiv情報

著者	Chunyong Hu,Hang Zheng,Kun Li,Jianyun Xu,Weibo Mao,Maochun Luo,Lingxuan Wang,Mingxia Chen,Qihao Peng,Kaixuan Liu,Yiru Zhao,Peihan Hao,Minzhe Liu,Kaicheng Yu
発行日	2023-10-06 09:46:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー