DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

要約

エンド・ツー・エンドの自律走行に関する研究は、知覚、予測、計画といったモジュール化されたタスクを統合した完全に微分可能な設計により、最終的な目標を追求するための最適化を可能にするため、急増している。エンド・ツー・エンドのパラダイムは大きな可能性を秘めているにもかかわらず、既存の手法は、高価なBEV（鳥瞰図）計算、行動の多様性、複雑な実世界シナリオにおける最適でない意思決定など、いくつかの側面に悩まされている。これらの課題に対処するために、我々は、Diff-VLAと呼ばれる、視覚言語モデル（VLM）によって強化された、新しいハイブリッドスパース-高密度拡散ポリシーを提案する。我々は、効率的なマルチモーダル運転行動のためのスパース拡散表現を探求する。さらに、VLMの運転判断の有効性を再考し、エージェント、地図インスタンス、VLM出力間の深い相互作用を通して、軌道生成ガイダンスを改善する。本手法は、困難な実シナリオと反応的な合成シナリオを含むAutonomous Grand Challenge 2025において優れた性能を示した。我々の手法は45.0 PDMSを達成した。

要約(オリジナル)

Research interest in end-to-end autonomous driving has surged owing to its fully differentiable design integrating modular tasks, i.e. perception, prediction and planing, which enables optimization in pursuit of the ultimate goal. Despite the great potential of the end-to-end paradigm, existing methods suffer from several aspects including expensive BEV (bird’s eye view) computation, action diversity, and sub-optimal decision in complex real-world scenarios. To address these challenges, we propose a novel hybrid sparse-dense diffusion policy, empowered by a Vision-Language Model (VLM), called Diff-VLA. We explore the sparse diffusion representation for efficient multi-modal driving behavior. Moreover, we rethink the effectiveness of VLM driving decision and improve the trajectory generation guidance through deep interaction across agent, map instances and VLM output. Our method shows superior performance in Autonomous Grand Challenge 2025 which contains challenging real and reactive synthetic scenarios. Our methods achieves 45.0 PDMS.

arxiv情報

著者	Anqing Jiang,Yu Gao,Zhigang Sun,Yiru Wang,Jijun Wang,Jinghao Chai,Qian Cao,Yuweng Heng,Hao Jiang,Yunda Dong,Zongzheng Zhang,Xianda Guo,Hao Sun,Hao Zhao
発行日	2025-06-03 02:28:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー