SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

要約

動的3Dアセット生成のためのマルチビュービデオ拡散モデルであるStable Video 4d 2.0（SV4D 2.0）を紹介します。
前身のSV4Dと比較して、SV4D 2.0はオクルージョンや大きな動きにより堅牢であり、現実世界のビデオによりよく一般化し、詳細なシャープネスと時空間整合性の観点から高品質の出力を生成します。
複数の側面に重要な改善を導入することでこれを達成します。1）ネットワークアーキテクチャ：参照マルチビューの依存性を排除し、3Dとフレームの注意のためのブレンディングメカニズムの設計を排除する、2）データの質と量のデータ：3）トレーニング戦略：トレーニング戦略：より良い一般化のためのプログレッシブ3D-4Dトレーニングの採用4）
広範な実験は、視覚的および定量的にSV4D 2.0による有意なパフォーマンスゲインを示し、SV4Dと比較して、小説ビデオ統合と4D最適化（-12 \％LPIPSおよび-24 \％FV4D）で、より良いディテール（-14 \％LPIPS）と4D一貫性（-44 \％FV4D）を達成します。

要約(オリジナル)

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D.

arxiv情報

著者	Chun-Han Yao,Yiming Xie,Vikram Voleti,Huaizu Jiang,Varun Jampani
発行日	2025-03-21 03:39:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー