VideoPanda: Video Panoramic Diffusion with Multi-view Attention

要約

高解像度のパノラマビデオコンテンツは、仮想現実の没入型エクスペリエンスにとって最も重要ですが、特殊な機器と複雑なカメラのセットアップを必要とするため、収集するのは重要ではありません。
この作業では、テキストまたはシングルビュービデオデータを条件付けられた360 $^\ circ $のビデオを合成するための新しいアプローチであるVideoPandaを紹介します。
VideoPandaは、マルチビューの注意レイヤーを活用してビデオ拡散モデルを強化し、没入型のパノラマコンテンツに組み合わせることができる一貫したマルチビュービデオを生成できるようにします。
VideoPandaは、テキストのみの条件とシングルビュービデオの2つの条件を使用して共同でトレーニングされており、オートレーフレフな生成の長いビデオをサポートしています。
マルチビュービデオ生成の計算上の負担を克服するために、トレーニング中に使用される期間とカメラビューをランダムにサブサンプリングし、モデルが推論中により多くのフレームを生成するために優雅に一般化できることを示します。
実世界と合成ビデオデータセットの両方での広範な評価は、Videopandaが既存の方法と比較してすべての入力条件でより現実的でコヒーレントな360 $^\ circ $パノラマを生成することを示しています。
結果については、プロジェクトのWebサイトhttps://research-claging.nvidia.com/labs/toronto-ai/videopanda/にアクセスしてください。

要約(オリジナル)

High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^\circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^\circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research-staging.nvidia.com/labs/toronto-ai/VideoPanda/ for results.

arxiv情報

著者	Kevin Xie,Amirmojtaba Sabour,Jiahui Huang,Despoina Paschalidou,Greg Klar,Umar Iqbal,Sanja Fidler,Xiaohui Zeng
発行日	2025-04-15 16:58:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoPanda: Video Panoramic Diffusion with Multi-view Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー