Tracking Meets Large Multimodal Models for Driving Scenario Understanding

要約

大規模なマルチモーダルモデル（LMM）は最近、自律運転研究で顕著になり、さまざまな新興ベンチマークにわたって有望な能力を紹介しています。
このドメイン向けに特別に設計されたLMMは、効果的な認識、計画、および予測スキルを実証しています。
ただし、これらの方法の多くは、主に画像データに依存して、3D空間的要素と時間的要素を十分に活用していません。
その結果、動的運転環境での有効性は限られています。
画像で効果的にキャプチャされていない3D空間的および時間的詳細を回復するための追加の入力として追跡情報を統合することを提案します。
この追跡情報をLMMSに埋め込むための新しいアプローチを紹介して、運転シナリオの時空の理解を高めることができます。
トラックエンコーダーを介して3D追跡データを組み込むことにより、長いビデオシーケンスまたは広範な3D入力の処理に関連する計算オーバーヘッドを回避しながら、重要な空間的および時間的キューで視覚クエリを強化します。
さらに、トラッキングエンコーダを取得するための自己監督のアプローチを採用して、LMMSに追加のコンテキスト情報を提供し、自律運転のための知覚、計画、予測タスクのパフォーマンスを大幅に改善します。
実験結果は、精度が9.5％増加し、ChatGPTスコアで7.04ポイント増加し、Drivelm-Nuscenesベンチマークのベースラインモデル全体で9.4％増加し、Drivelm-Carlaでの最終スコアの改善が3.7％増加し、アプローチの有効性を示しています。
私たちのコードは、https：//github.com/mbzuai-oryx/trackingmeetslmmで入手できます

要約(オリジナル)

Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM

arxiv情報

著者	Ayesha Ishaq,Jean Lahoud,Fahad Shahbaz Khan,Salman Khan,Hisham Cholakkal,Rao Muhammad Anwer
発行日	2025-03-18 17:59:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー