DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

要約

大規模マルチモーダルモデル (LMM) は、大規模な言語モデルを組み込むことで、自動運転 (AD) における優れた理解力と解釈能力を実証しました。
進歩にもかかわらず、現在のデータ駆動型 AD アプローチは、単一のデータセットと特定のタスクに集中する傾向があり、その全体的な機能と一般化する能力を無視しています。
これらのギャップを埋めるために、認識、予測、計画を含む広範な AD タスクを実行しながら、画像やマルチビュービデオなどの多様なデータ入力を処理するように設計された一般的な大規模マルチモーダルモデルである DriveMM を提案します。
最初に、モデルはさまざまな視覚信号を処理し、基本的な視覚理解と知覚タスクを実行するためのカリキュラムの事前トレーニングを受けます。
その後、さまざまな AD 関連のデータセットを拡張および標準化してモデルを微調整し、自動運転用のオールインワン LMM が完成しました。
一般的な機能と汎化能力を評価するために、6 つの公開ベンチマークで評価を実施し、目に見えないデータセットでゼロショット転送を実行します。これにより、DriveMM はすべてのタスクにわたって最先端のパフォーマンスを達成します。
私たちは、DriveMM が現実世界での将来のエンドツーエンドの自動運転アプリケーションにとって有望なソリューションとなることを期待しています。
コードを含むプロジェクトページ: https://github.com/zhijian11/DriveMM。

要約(オリジナル)

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.

arxiv情報

著者	Zhijian Huang,Chengjian Feng,Feng Yan,Baihui Xiao,Zequn Jie,Yujie Zhong,Xiaodan Liang,Lin Ma
発行日	2024-12-12 02:47:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー