Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

要約

我々は、任意の視点の下で単一の画像から高品質で時空間的に一貫した人間のビデオを生成するための新しいアプローチを提案します。
私たちのフレームワークは、正確な条件注入のための U-Net と、視点や時間にわたるグローバルな相関を捕捉するための拡散トランスフォーマーの長所を組み合わせています。
コアとなるのは、ビュー、時間、空間次元全体で注意を因数分解するカスケード 4D トランスフォーマーアーキテクチャであり、4D 空間の効率的なモデリングを可能にします。
正確な調整は、人間のアイデンティティ、カメラのパラメータ、および時間信号をそれぞれのトランスフォーマーに注入することによって実現されます。
このモデルをトレーニングするために、多次元トレーニング戦略とともに、画像、ビデオ、マルチビューデータ、3D/4D スキャンにわたる多次元データセットを厳選します。
私たちのアプローチは、複雑な動きや視点の変更に苦労する GAN または UNet ベースの拡散モデルに基づく以前の手法の限界を克服します。
広範な実験を通じて、私たちは、現実的で一貫性のある自由に視聴できる人間のビデオを合成するこの手法の能力を実証し、仮想現実やアニメーションなどの分野における高度なマルチメディアアプリケーションへの道を切り開きます。
私たちのプロジェクトの Web サイトは https://human4dit.github.io です。

要約(オリジナル)

We present a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints. Our framework combines the strengths of U-Nets for accurate condition injection and diffusion transformers for capturing global correlations across viewpoints and time. The core is a cascaded 4D transformer architecture that factorizes attention across views, time, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we curate a multi-dimensional dataset spanning images, videos, multi-view data and 3D/4D scans, along with a multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on GAN or UNet-based diffusion models, which struggle with complex motions and viewpoint changes. Through extensive experiments, we demonstrate our method’s ability to synthesize realistic, coherent and free-view human videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation. Our project website is https://human4dit.github.io.

arxiv情報

著者	Ruizhi Shao,Youxin Pang,Zerong Zheng,Jingxiang Sun,Yebin Liu
発行日	2024-05-27 17:53:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー