Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

要約

エンドツーエンドの自動運転は、大規模なデータを使用した強力な計画能力を実証しますが、常識が限られているため、複雑でまれなシナリオでは依然として困難を伴います。
対照的に、Large Vision-Language Model (LVLM) は、シーンの理解と推論に優れています。
今後の道は、両方のアプローチの長所を融合することにあります。
LVLM は正確な数値予測には適していないため、LVLM を使用して軌道や制御信号を予測するこれまでの方法では、次善の結果が得られました。
本稿では、LVLM (Senna-VLM) とエンドツーエンドモデル (Senna-E2E) を組み合わせた自動運転システム Senna について紹介します。
Senna は、高レベルの計画を低レベルの軌道予測から切り離します。
Senna-VLM は自然言語で計画決定を生成し、Senna-E2E は正確な軌道を予測します。
Senna-VLM は、マルチイメージエンコーディングアプローチとマルチビュープロンプトを利用して、シーンを効率的に理解します。
さらに、3 段階のトレーニング戦略とともに計画指向の QA を導入し、常識を維持しながら Senna-VLM の計画パフォーマンスを向上させます。
2 つのデータセットに対する広範な実験により、Senna が最先端の計画パフォーマンスを達成していることがわかりました。
特に、大規模データセット DriveX での事前トレーニングと nuScenes での微調整により、Senna は事前トレーニングなしのモデルと比較して、平均計画誤差を 27.12%、衝突率を 33.33% それぞれ大幅に削減しました。
私たちは、セナのクロスシナリオ一般化と移行可能性が完全自動運転の実現に不可欠であると信じています。
コードとモデルは https://github.com/hustvl/Senna でリリースされます。

要約(オリジナル)

End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM’s planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna’s cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at https://github.com/hustvl/Senna.

arxiv情報

著者	Bo Jiang,Shaoyu Chen,Bencheng Liao,Xingyu Zhang,Wei Yin,Qian Zhang,Chang Huang,Wenyu Liu,Xinggang Wang
発行日	2024-10-29 17:53:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー