Probing Multimodal LLMs as World Models for Driving

要約

私たちは、自動運転の領域におけるマルチモーダル大規模言語モデル (MLLM) の応用を冷静に検討し、閉じた画像/フレームのシーケンスを通じて動的運転シナリオを推論し解釈する能力に焦点を当てて、いくつかの一般的な前提に疑問を投げかけ、検証します。
-ループ制御環境。
GPT-4V のような MLLM は大幅に進歩しているにもかかわらず、複雑で動的な運転環境におけるそのパフォーマンスはほとんどテストされておらず、幅広い探求の余地があります。
私たちは、車載固定カメラの観点から、世界の運転モデルとしてのさまざまなMLLMの能力を評価するための包括的な実験研究を実施しています。
私たちの調査結果から、これらのモデルは個々の画像を巧みに解釈する一方で、動的な動作を表すフレーム全体で一貫した物語や論理的なシーケンスを合成するのに非常に苦労していることが明らかになりました。
実験では、(i) 基本的な車両のダイナミクス (前進/後進、加速/減速、右折または左折)、(ii) 他の道路アクターとの相互作用 (例: スピード違反の車や渋滞の識別)、(iii) の予測においてかなりの不正確性が実証されました。
軌道計画、および (iv) モデルのトレーニングデータのバイアスを示唆するオープンセットの動的シーン推論。
この実験的研究を可能にするために、多様な運転シナリオを生成するように設計された特殊なシミュレーター DriveSim を導入し、運転の分野で MLLM を評価するためのプラットフォームを提供します。
さらに、運転における MLLM を評価するための完全なオープンソースコードと新しいデータセット「Eval-LLM-Drive」を提供します。
私たちの結果は、最先端の MLLM の現在の機能における重大なギャップを浮き彫りにし、現実世界の動的な環境での適用性を向上させるために強化された基礎モデルの必要性を強調しています。

要約(オリジナル)

We provide a sober look at the application of Multimodal Large Language Models (MLLMs) within the domain of autonomous driving and challenge/verify some common assumptions, focusing on their ability to reason and interpret dynamic driving scenarios through sequences of images/frames in a closed-loop control environment. Despite the significant advancements in MLLMs like GPT-4V, their performance in complex, dynamic driving environments remains largely untested and presents a wide area of exploration. We conduct a comprehensive experimental study to evaluate the capability of various MLLMs as world models for driving from the perspective of a fixed in-car camera. Our findings reveal that, while these models proficiently interpret individual images, they struggle significantly with synthesizing coherent narratives or logical sequences across frames depicting dynamic behavior. The experiments demonstrate considerable inaccuracies in predicting (i) basic vehicle dynamics (forward/backward, acceleration/deceleration, turning right or left), (ii) interactions with other road actors (e.g., identifying speeding cars or heavy traffic), (iii) trajectory planning, and (iv) open-set dynamic scene reasoning, suggesting biases in the models’ training data. To enable this experimental study we introduce a specialized simulator, DriveSim, designed to generate diverse driving scenarios, providing a platform for evaluating MLLMs in the realms of driving. Additionally, we contribute the full open-source code and a new dataset, ‘Eval-LLM-Drive’, for evaluating MLLMs in driving. Our results highlight a critical gap in the current capabilities of state-of-the-art MLLMs, underscoring the need for enhanced foundation models to improve their applicability in real-world dynamic environments.

arxiv情報

著者	Shiva Sreeram,Tsun-Hsuan Wang,Alaa Maalouf,Guy Rosman,Sertac Karaman,Daniela Rus
発行日	2024-05-09 17:52:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probing Multimodal LLMs as World Models for Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー