Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

要約

タイトル：効率的なテキストから動画生成のためのLatent-Shift：時間的シフトを伴う潜在拡散

要約：
– Latent-Shiftは、オートエンコーダーとU-Net拡散モデルから構成される事前学習済みのテキストから画像生成モデルに基づく、効率的なテキストからビデオ生成手法を提案する。
– ラテント空間でのビデオ拡散モデルの学習は、ピクセル空間での学習に比べてはるかに効率的である。前者は、低解像度のビデオをまず生成し、その後、フレームの補間やスーパーリゾリューションモデルのシーケンスを行う必要があるため、パイプライン全体が非常に複雑でコンピュータリソースを消費する。
– 画像生成からビデオ生成へのU-Netの拡張には、1Dの時間的畳み込みモジュールや時間的注意レイヤーのような追加のモジュールを追加することが先行研究において提案されてきた。一方、我々は、追加パラメータが必要ない時間的シフトモジュールを提案することで、空間的なU-Netをそのままビデオ生成に活用する。
– 移動学習に必要なパラメータを追加することなく、現在のフレームのシフトされた特徴には、前のフレームと後続のフレームから特徴が与えられるため、ラテントシフトは、移動学習を実現する。我々は、Latent-Shiftが、有意義に簡潔していて適切な結果を生み出し、効率が良いことを示している。さらに、T2V生成にファインチューニングされているにもかかわらず、Latent-Shiftはイメージを生成することができる。

要約(オリジナル)

We propose Latent-Shift — an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.

arxiv情報

著者	Jie An,Songyang Zhang,Harry Yang,Sonal Gupta,Jia-Bin Huang,Jiebo Luo,Xi Yin
発行日	2023-04-18 03:27:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー