OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

要約

マルチモーダル大規模言語モデル (MLLM) の台頭により、自動運転への応用が加速しています。
最近の MLLM ベースの手法は、世界のダイナミクスやアクションと世界のダイナミクスの間の関係を無視して、知覚からアクションへの直接マッピングを学習することによってアクションを実行します。
対照的に、人間は、3D の内部視覚表現に基づいて将来の状態をシミュレートし、それに応じて行動を計画することを可能にする世界モデルを持っています。
この目的のために、我々は、一般的な視覚表現として意味論的占有を使用し、自己回帰モデルを通じて視覚-言語-行動(VLA)モダリティを統合する占有-言語-行動生成世界モデルであるOccLLaMAを提案する。
具体的には、スパース性とクラスの不均衡を考慮して、意味論的占有シーンを効率的に離散化および再構築するための、新しい VQVAE のようなシーントークナイザーを導入します。
次に、視覚、言語、行動に関する統合されたマルチモーダルな語彙を構築します。
さらに、LLM、特に LLaMA を強化して、統合語彙に基づいて次のトークン/シーン予測を実行し、自動運転における複数のタスクを完了します。
広範な実験により、OccLLaMA が 4D 占有予測、動作計画、視覚的な質問応答などの複数のタスクにわたって競争力のあるパフォーマンスを達成することが実証され、自動運転の基礎モデルとしての可能性が示されています。

要約(オリジナル)

The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.

arxiv情報

著者	Julong Wei,Shanshuai Yuan,Pengfei Li,Qingda Hu,Zhongxue Gan,Wenchao Ding
発行日	2024-09-05 06:30:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー