CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection

要約

自律走行における高精度な3次元物体検出を低コストで実現するために、多くのマルチカメラ手法が提案され、単眼アプローチのオクルージョン問題を解決してきた。しかし、既存のマルチカメラ方式では、正確な奥行き推定ができないため、歩行者のような難しい小物体に対しては、奥行き方向のレイに沿って複数のバウンディングボックスを生成することが多く、結果として回収率が極端に低くなってしまいます。また、一般的に大規模なネットワークアーキテクチャで構成される既存のマルチカメラ手法に直接奥行き予測モジュールを適用しても、自動運転アプリケーションのリアルタイム要件を満たすことはできません。これらの問題を解決するために、我々はCross-view and Depth-guided Transformers for 3D Object Detection, CrossDTRを提案する。まず、我々の軽量深度予測器は、監視中に余分な深度データセットなしで、正確なオブジェクト単位の疎な深度マップと低次元の深度埋め込みを生成するように設計されています。第二に、異なるビューのカメラからの深度埋め込みと画像特徴を融合し、3Dバウンディングボックスを生成するクロスビュー深度ガイド変換器を開発する。本手法は、歩行者検出において既存のマルチカメラ手法を10%上回り、mAPとNDSの総合指標においても3%程度上回ることが、広範な実験により実証されました。また、計算機解析の結果、本手法は従来の手法に比べ5倍高速であることが示されました。我々のコードは、https://github.com/sty61010/CrossDTR で公開される予定です。

要約(オリジナル)

To achieve accurate 3D object detection at a low cost for autonomous driving, many multi-camera methods have been proposed and solved the occlusion problem of monocular approaches. However, due to the lack of accurate estimated depth, existing multi-camera methods often generate multiple bounding boxes along a ray of depth direction for difficult small objects such as pedestrians, resulting in an extremely low recall. Furthermore, directly applying depth prediction modules to existing multi-camera methods, generally composed of large network architectures, cannot meet the real-time requirements of self-driving applications. To address these issues, we propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our lightweight depth predictor is designed to produce precise object-wise sparse depth maps and low-dimensional depth embeddings without extra depth datasets during supervision. Second, a cross-view depth-guided transformer is developed to fuse the depth embeddings as well as image features from cameras of different views and generate 3D bounding boxes. Extensive experiments demonstrated that our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics. Also, computational analyses showed that our method is 5 times faster than prior approaches. Our codes will be made publicly available at https://github.com/sty61010/CrossDTR.

arxiv情報

著者	Ching-Yu Tseng,Yi-Rong Chen,Hsin-Ying Lee,Tsung-Han Wu,Wen-Chin Chen,Winston H. Hsu
発行日	2023-02-03 10:39:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー