VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

要約

最新のテキストからビデオへの合成モデルは、テキストの説明から複雑なビデオを一貫してフォトリアリスティックに生成することを示しています。
ただし、既存のモデルのほとんどには、コンテンツ作成、視覚効果、3D ビジョンに関連する下流アプリケーションにとって重要な、カメラの動きに対するきめ細かい制御がありません。
最近、新しい方法により、制御可能なカメラポーズでビデオを生成する機能が実証されています。これらの技術は、空間的および時間的生成を明示的に解きほぐす、事前にトレーニングされた U-Net ベースの拡散モデルを活用しています。
それでも、空間情報と時間情報を共同で処理する新しいトランスベースのビデオ拡散モデルのカメラ制御を可能にする既存のアプローチはありません。
ここでは、Plucker 座標に基づいた時空間カメラの埋め込みを組み込んだ ControlNet のような調整メカニズムを使用して、3D カメラ制御用のビデオトランスフォーマーを飼いならすことを提案します。
このアプローチは、RealEstate10K データセットでの微調整後の制御可能なビデオ生成の最先端のパフォーマンスを実証します。
私たちの知る限り、私たちの研究は、トランスベースのビデオ拡散モデルのカメラ制御を可能にした最初のものです。

要約(オリジナル)

Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.

arxiv情報

著者	Sherwin Bahmani,Ivan Skorokhodov,Aliaksandr Siarohin,Willi Menapace,Guocheng Qian,Michael Vasilkovsky,Hsin-Ying Lee,Chaoyang Wang,Jiaxu Zou,Andrea Tagliasacchi,David B. Lindell,Sergey Tulyakov
発行日	2024-07-17 17:59:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー