DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

要約

自動運転技術の進歩により、現実世界のシナリオを理解して予測するための、ますます洗練された方法が必要になります。
ビジョン言語モデル (VLM) は、自動運転に影響を与える大きな可能性を秘めた革新的なツールとして登場しています。
この論文では、運転ビデオを生成し、VLM を使用してそれを理解するための DriveGenVLM フレームワークを提案します。
これを達成するために、現実世界のビデオシーケンスを予測することを目的としたノイズ除去拡散確率モデル (DDPM) に基づいたビデオ生成フレームワークを採用します。
次に、Efficient In-context Learning on Egocentric Video (EILEV) として知られる事前トレーニング済みモデルを使用して、生成されたビデオが VLM で使用するのに適切であるかどうかを調査します。
拡散モデルは Waymo オープンデータセットでトレーニングされ、Fr\’echet Video Distance (FVD) スコアを使用して評価され、生成されたビデオの品質とリアリズムが保証されます。
これらの生成されたビデオには、対応するナレーションが EILEV によって提供されており、自動運転の分野では有益である可能性があります。
これらのナレーションは、交通状況の理解を強化し、ナビゲーションを支援し、計画能力を向上させることができます。
DriveGenVLM フレームワークにおけるビデオ生成と VLM の統合は、高度な AI モデルを活用して自動運転における複雑な課題に対処する上での大きな前進となります。

要約(オリジナル)

The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fr\’echet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

arxiv情報

著者	Yongjie Fu,Anmol Jain,Xuan Di,Xu Chen,Zhaobin Mo
発行日	2024-08-29 15:52:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー