VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

要約

本論文では、参照ガイド付き潜在拡散を用いて、高フレーム忠実度と強い時間的一貫性を持つ高精細ビデオを生成できる、テキストからビデオへの生成アプローチであるVideoGenを紹介する。本論文では、ビデオ生成をガイドする参照画像として、テキストプロンプトから高いコンテンツ品質を持つ画像を生成するために、安定拡散などの市販のテキストから画像への生成モデルを活用する。次に、参照画像とテキストプロンプトの両方を条件とする効率的なカスケード潜在拡散モジュールを導入し、潜在ビデオ表現を生成する。最後に、拡張ビデオデコーダを通して、潜在ビデオ表現を高精細ビデオにマッピングする。学習時には、カスケード接続された潜在拡散モジュールを学習するための参照画像として、真正映像の最初のフレームを使用する。本アプローチの主な特徴として、テキスト画像モデルによって生成された参照画像が視覚的忠実度を向上させること、それを条件として使用することで、拡散モデルがビデオダイナミクスの学習に集中できること、ビデオデコーダがラベル付けされていないビデオデータで学習されるため、容易に入手可能な高品質ビデオの恩恵を受けられること、などが挙げられる。VideoGenは、質的・量的評価の両面で、テキストから動画への生成における新たな最先端を打ち立てた。

要約(オリジナル)

In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation.

arxiv情報

著者	Xin Li,Wenqing Chu,Ye Wu,Weihang Yuan,Fanglong Liu,Qi Zhang,Fu Li,Haocheng Feng,Errui Ding,Jingdong Wang
発行日	2023-09-01 11:14:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー