Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

要約

自律走行トレーニングのためのマルチビュー映像の生成は最近注目を集めており、クロスビューとクロスフレームの整合性の両方に対処することが課題となっている。既存の手法では、空間、時間、ビューの各次元に対して、通常、分離された注意メカニズムを適用している。しかし、これらのアプローチは、特に異なる時間や視点に現れる高速で移動する物体を扱う場合、次元間の一貫性を維持するのに苦労することが多い。本論文では、高品質の多視点運転映像を合成するために設計された新しいネットワークであるCogDrivingを紹介する。CogDrivingは、空間的、時間的、視点的次元にまたがる同時関連付けを可能にする、全体的4次元注意モジュールを備えた拡散トランスフォーマーアーキテクチャを活用する。また、CogDriving用に調整された軽量コントローラ、すなわち標準的なControlNetの1.1%のパラメータしか使用しないMicro-Controllerを提案し、バーズアイビューレイアウトの正確な制御を可能にする。自律走行に重要なオブジェクトインスタンスの生成を強化するために、我々は再重み付け学習目的を提案し、学習中にオブジェクトインスタンスの学習重みを動的に調整する。CogDrivingは、nuScenes検証セットで37.8のFVDスコアを達成し、現実的な運転ビデオを生成する能力を強調し、強力な性能を示している。このプロジェクトはhttps://luhannan.github.io/CogDrivingPage/。

要約(オリジナル)

Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird’s-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at https://luhannan.github.io/CogDrivingPage/.

arxiv情報

著者	Hannan Lu,Xiaohe Wu,Shudong Wang,Xiameng Qin,Xinyu Zhang,Junyu Han,Wangmeng Zuo,Ji Tao
発行日	2024-12-04 18:02:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー