S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

要約

マルチモーダル大規模言語モデル(MLLM)の最新の進歩により、自律走行のためのエンドツーエンドの運動計画アプローチに再び強い関心が集まっている。多くのエンドツーエンドアプローチは、中間的な知覚と予測タスクを学習するために、人間の注釈に依存している。一方、純粋な自己教師アプローチは、人間の注釈なしで、計画軌道を生成するためにセンサ入力から直接学習するが、多くの場合、最先端の技術を下回っている。我々は、入力表現空間における重要なギャップを観察している：MLLMで構築されたエンドツーエンドのアプローチは、自律走行車が計画を立てる本来の3D空間ではなく、2D画像空間における推論タスクで事前学習されることが多い。この目的のために、我々は、人気のあるPaLIマルチモーダル大規模言語モデルをベースとした、時空間視覚表現によるスケーラブルな自己教師付き運動計画アルゴリズムであるS4-Driverを提案する。S4-Driverは、ビジョンエンコーダを微調整することなく、MLLMの強い視覚表現を透視図から3D空間へシームレスに変換するために、新しいスパースボリューム戦略を用いる。この表現により、多視点・多フレームの視覚入力が集約され、3D空間における計画軌道の予測精度が向上する。本手法を検証するため、nuScenesとWaymo Open Motion Dataset（社内カメラデータ）の両方で実験を行った。その結果、S4-Driverは、人間の注釈を必要としない一方で、既存の教師ありマルチタスクアプローチに対して優れた性能を発揮することが示された。S4-Driverはまた、注釈のない大量の運転ログに対して事前に学習させることで、優れたスケーラビリティを示す。

要約(オリジナル)

The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches–which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.

arxiv情報

著者	Yichen Xie,Runsheng Xu,Tong He,Jyh-Jing Hwang,Katie Luo,Jingwei Ji,Hubert Lin,Letian Chen,Yiren Lu,Zhaoqi Leng,Dragomir Anguelov,Mingxing Tan
発行日	2025-06-03 17:03:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー