NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

要約

マルチモーダルの大手言語モデル（MLLM）の最近の進歩は、さまざまなドメインで強力なパフォーマンスを実証しています。
ただし、運転シーンを理解する能力はあまり証明されていません。
マルチビュー情報を含む運転シナリオの複雑さは、既存のMLLMに大きな課題をもたらします。
この論文では、運転シーンの理解のためのマルチビュー、マルチモーダル評価ベンチマークであるNuplanqa-Evalを紹介します。
マルチビュードライビングシナリオへの一般化をさらにサポートするために、1Mの実際の視覚的質問（VQA）ペアを含む大規模なデータセットであるNuplanqa-1Mも提案します。
トラフィックシーンのコンテキスト認識分析のために、データセットを3つのコアスキル、道路環境認識、空間関係認識、自我中心の推論にまたがる9つのサブタスクに分類します。
さらに、bev-llmを提示し、マルチビュー画像の鳥瞰図（BEV）機能をMLLMSに統合します。
私たちの評価結果は、既存のMLLMが自我中心の視点からのシーン固有の認識と空間的推論を運転する際に直面する重要な課題を明らかにしています。
対照的に、BEV-LLMはこのドメインに対する顕著な適応性を示し、9つのサブタスクのうち6つで他のモデルよりも優れています。
これらの調査結果は、BEV統合がマルチビューMLLMSを強化すると同時に、運転シーンへの効果的な適応のためにさらに改良を必要とする重要な領域を特定する方法を強調しています。
さらなる研究を促進するために、https://github.com/sungyeonparkk/nuplanqaでNuplanqaを公開しています。

要約(オリジナル)

Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird’s-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at https://github.com/sungyeonparkk/NuPlanQA.

arxiv情報

著者	Sung-Yeon Park,Can Cui,Yunsheng Ma,Ahmadreza Moradipari,Rohit Gupta,Kyungtae Han,Ziran Wang
発行日	2025-03-17 03:12:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー