Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

要約

普及モデルはフォトリアリスティックな画像を生成する強力な能力を示しているにもかかわらず、リアルで多様なビデオの生成はまだ初期段階にあります。
主な理由の 1 つは、現在の方法では空間コンテンツと時間ダイナミクスが絡み合い、テキストからビデオへの生成 (T2V) の複雑さが著しく増加していることです。
本研究では、ビデオの空間的要素と時間的要素を構造レベルとコンテンツレベルの 2 つの観点から分離することでパフォーマンスを向上させる拡散モデルベースの手法 HiGen を提案します。
構造レベルでは、統合されたデノイザーを使用して、T2V タスクを空間推論と時間推論を含む 2 つのステップに分解します。
具体的には、空間推論中にテキストを使用して空間的にコヒーレントな事前分布を生成し、次に時間推論中にこれらの事前分布から時間的にコヒーレントな動きを生成します。
コンテンツレベルでは、入力ビデオのコンテンツから、動きと外観の変化をそれぞれ表現できる 2 つの微妙な手がかりを抽出します。
これら 2 つの手がかりは、ビデオを生成するためのモデルのトレーニングをガイドし、柔軟なコンテンツのバリエーションを可能にし、時間的安定性を高めます。
分離されたパラダイムを通じて、HiGen はこのタスクの複雑さを効果的に軽減し、セマンティクスの精度とモーションの安定性を備えたリアルなビデオを生成できます。
広範な実験により、最先端の T2V メソッドよりも HiGen の優れたパフォーマンスが実証されています。

要約(オリジナル)

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model’s training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.

arxiv情報

著者	Zhiwu Qing,Shiwei Zhang,Jiayu Wang,Xiang Wang,Yujie Wei,Yingya Zhang,Changxin Gao,Nong Sang
発行日	2023-12-07 17:59:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー