Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

要約

近年、画像からビデオへの生成において目覚ましい進歩が見られます。
しかし、生成されたフレームの 3D 一貫性とカメラの制御性は未解決のままです。
最近の研究では生成プロセスにカメラ制御を組み込むことが試みられていますが、その結果は単純な軌跡に限定されるか、同じシーンの複数の異なるカメラパスから一貫したビデオを生成する機能が欠けていることがよくあります。
これらの制限に対処するために、入力画像を時空間的に一貫した複数のビデオに変換できる、カメラ制御可能なマルチビュービデオ生成用の新しいフレームワークである Cavia を紹介します。
私たちのフレームワークは、空間的および時間的注意モジュールをビュー統合型注意モジュールに拡張し、視点と時間的一貫性の両方を向上させます。
この柔軟な設計により、シーンレベルの静的ビデオ、オブジェクトレベルの合成マルチビューダイナミックビデオ、現実世界の単眼ダイナミックビデオなど、厳選された多様なデータソースを使用した共同トレーニングが可能になります。
私たちの知る限り、Cavia は、ユーザーがオブジェクトの動きを取得しながらカメラの動きを正確に指定できるようにするこの種の最初のものです。
広範な実験により、Cavia は幾何学的一貫性と知覚品質の点で最先端の手法を上回っていることが実証されました。
プロジェクトページ: https://ir1d.github.io/Cavia/

要約(オリジナル)

In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: https://ir1d.github.io/Cavia/

arxiv情報

著者	Dejia Xu,Yifan Jiang,Chen Huang,Liangchen Song,Thorsten Gernoth,Liangliang Cao,Zhangyang Wang,Hao Tang
発行日	2024-10-14 17:46:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー