ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

要約

3D 占有は、運転シナリオ用の高度な認識テクノロジーであり、物理空間をグリッドマップに定量化することで、前景と背景を区別せずにシーン全体を表現します。
画像特徴を 3D 表現に変換するのに効率的で広く採用されている投影優先変形アテンションは、センサー展開の制約によりマルチビュー特徴を集約する際に課題に直面しています。
この問題に対処するために、効果的なマルチビュー特徴集約のための学習優先ビューアテンションメカニズムを提案します。
さらに、マップ構築や 3D オブジェクト検出など、さまざまなマルチビュー 3D タスクにわたるビューアテンションのスケーラビリティを示します。
提案されたビューアテンションと追加のマルチフレームストリーミング時間アテンションを活用して、時空間特徴集約のためのビジョン中心のトランスフォーマーベースのフレームワークである ViewFormer を紹介します。
占有レベルのフロー表現をさらに調査するために、既存の高品質データセットの上に構築されたベンチマークである FlowOcc3D を紹介します。
このベンチマークの定性的および定量的分析により、きめの細かいダイナミックなシーンを表現できる可能性が明らかになります。
広範な実験により、私たちのアプローチが従来の最先端の方法を大幅に上回ることが示されました。
コードとベンチマークは近日公開される予定です。

要約(オリジナル)

3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, such as map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes and benchmark will be released soon.

arxiv情報

著者	Jinke Li,Xiao He,Chonghua Zhou,Xiaoqiang Cheng,Yang Wen,Dan Zhang
発行日	2024-05-07 13:15:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー