Distilling Multi-modal Large Language Models for Autonomous Driving

要約

自動運転には、特に重要な「ロングテール」シナリオにおいて、安全な動作計画が必要です。
最近のエンドツーエンドの自動運転システムは、まれなイベントに対する一般化可能性を向上させるためのプランナーとして大規模言語モデル (LLM) を活用しています。
ただし、テスト時に LLM を使用すると、計算コストが高くなります。
これに対処するために、LLM の世界的な知識を活用しながら、LLM フリー (またはビジョンベース) プランナーの効率を維持するエンドツーエンドの自動運転システムである DiMA を提案します。
DiMA は、特別に設計された一連の代理タスクを通じて、マルチモーダル LLM からビジョンベースのエンドツーエンドプランナーに情報を抽出します。
共同トレーニング戦略の下では、両方のネットワークに共通のシーンエンコーダーが、意味論的に根拠があり、最終的な計画目標に合わせた構造化表現を生成します。
特に、LLM は推論時にオプションであり、効率を犠牲にすることなく堅牢な計画を可能にします。
DiMA を使用したトレーニングにより、ビジョンベースのプランナーの L2 軌道誤差が 37% 減少し、衝突率が 80% 減少しました。また、ロングテールシナリオでは軌道誤差が 44% 減少しました。
DiMA は、nuScenes 計画ベンチマークでも最先端のパフォーマンスを実現します。

要約(オリジナル)

Autonomous driving demands safe motion planning, especially in critical ‘long-tail’ scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.

arxiv情報

著者	Deepti Hegde,Rajeev Yasarla,Hong Cai,Shizhong Han,Apratim Bhattacharyya,Shweta Mahajan,Litian Liu,Risheek Garrepalli,Vishal M. Patel,Fatih Porikli
発行日	2025-01-16 18:59:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilling Multi-modal Large Language Models for Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー