BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment

要約

自動運転と移動ロボット工学の分野では、Bird’s Eye View (BEV) 表現の作成に使用される方法に大きな変化がありました。
この変化の特徴は、変圧器を使用し、主に LIDAR とカメラなどの異種視覚センサーからの測定値を地上ベースの 2D 平面表現に融合する方法を学習することです。
ただし、このような地図を作成するための学習ベースの方法は、大規模な注釈付きデータに大きく依存することが多く、特に大規模なデータセットが不足している多様な環境や非都市環境では、顕著な課題が生じます。
この研究では、センサーのポーズを誘導監視信号として使用し、カメラと LIDAR データからの BEV 表現を統合するフレームワークである BEVPose を紹介します。
この方法により、コストのかかる注釈付きデータへの依存が大幅に軽減されます。
姿勢情報を活用することで、マルチモーダルな感覚入力を調整して融合し、環境の幾何学的側面と意味論的側面の両方を捕捉する潜在的な BEV 埋め込みの学習を促進します。
私たちの事前トレーニングアプローチは、BEV マップセグメンテーションタスクにおいて有望なパフォーマンスを実証し、完全に監視された最先端の手法を上回るパフォーマンスを示しながら、最小限の量の注釈付きデータのみを必要とします。
この開発は、BEV 表現学習におけるデータ効率の課題に直面するだけでなく、オフロード環境や屋内環境を含むさまざまな領域でそのような技術の可能性を広げます。

要約(オリジナル)

In the field of autonomous driving and mobile robotics, there has been a significant shift in the methods used to create Bird’s Eye View (BEV) representations. This shift is characterised by using transformers and learning to fuse measurements from disparate vision sensors, mainly lidar and cameras, into a 2D planar ground-based representation. However, these learning-based methods for creating such maps often rely heavily on extensive annotated data, presenting notable challenges, particularly in diverse or non-urban environments where large-scale datasets are scarce. In this work, we present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal. This method notably reduces the dependence on costly annotated data. By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment. Our pretraining approach demonstrates promising performance in BEV map segmentation tasks, outperforming fully-supervised state-of-the-art methods, while necessitating only a minimal amount of annotated data. This development not only confronts the challenge of data efficiency in BEV representation learning but also broadens the potential for such techniques in a variety of domains, including off-road and indoor environments.

arxiv情報

著者	Mehdi Hosseinzadeh,Ian Reid
発行日	2024-10-28 12:40:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー