Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

要約

近年のカメラのみによる3D検出手法は、複数のタイムステップを利用しているが、使用する履歴が限られているため、時間的融合による物体知覚の向上には大きな障害となる。既存技術のマルチフレーム画像の融合は、時間的ステレオマッチングのインスタンスであることを観察し、我々は、1）マッチングの解像度の低い粒度と2）限られた履歴使用によって生じる最適ではないマルチビュー設定との相互作用によって性能が阻害されることを発見する。我々の理論的・実証的な分析により、ビュー間の最適な時間差はピクセルや深さによって大きく異なるため、長時間の履歴で多くのタイムステップを融合させる必要があることを実証した。我々の調査に基づき、我々は、より最適なマルチビューマッチング設定により、粗いが効率的なマッチング解像度を補い、長い画像観測の履歴からコストボリュームを生成することを提案する。さらに、長期的で粗いマッチングに用いられるフレーム毎の単眼深度予測を短期的で細かいマッチングで補強し、長期と短期の時間的融合が非常に相補的であることを見出す。本フレームワークは、高い効率を維持しながら、nuScenesにおいて、テストセットで1位、検証セットで5.2%のmAPと3.7%のNDSを達成し、新しい最先端技術を確立しました。コードは$href{https://github.com/Divadi/SOLOFusion}{here.}$で公開される予定です。

要約(オリジナル)

While recent camera-only 3D detection methods leverage multiple timesteps, the limited history they use significantly hampers the extent to which temporal fusion can improve object perception. Observing that existing works’ fusion of multi-frame images are instances of temporal stereo matching, we find that performance is hindered by the interplay between 1) the low granularity of matching resolution and 2) the sub-optimal multi-view setup produced by limited history usage. Our theoretical and empirical analysis demonstrates that the optimal temporal difference between views varies significantly for different pixels and depths, making it necessary to fuse many timesteps over long-term history. Building on our investigation, we propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup. Further, we augment the per-frame monocular depth predictions used for long-term, coarse matching with short-term, fine-grained matching and find that long and short term temporal fusion are highly complementary. While maintaining high efficiency, our framework sets new state-of-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set. Code will be released $\href{https://github.com/Divadi/SOLOFusion}{here.}$

arxiv情報

著者	Jinhyung Park,Chenfeng Xu,Shijia Yang,Kurt Keutzer,Kris Kitani,Masayoshi Tomizuka,Wei Zhan
発行日	2022-10-05 17:59:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー