Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

要約

タイトル：効率的なテキストから動画生成のためのラテントシフト：時間移動を伴うラテントディフュージョン

要約：

– ラテントシフトは、オートエンコーダとU-Netディフュージョンモデルで構成される事前学習済みのテキストからイメージ生成モデルに基づく効率的なテキストから動画生成手法を提案している。
– ラテント空間で動画ディフュージョンモデルを学習することは、ピクセル空間で学習するよりも効率的である。
– 従来の方法は、低解像度の動画を生成した後、フレーム補間とスーパーレゾリューションモデルのシーケンスを実行することが多く、全体のパイプラインが非常に複雑で計算量が多くなってしまうため、拡張性に課題があった。
– U-Netを画像生成から動画生成に拡張するには、1D時系列畳み込みや時系列アテンションレイヤーなどの追加モジュールが必要であったが、ラテントシフトでは、パラメータフリーの時間移動モジュールを提案している。
– 現在のフレームのシフトされた特徴量には、前後のフレームからの特徴量が付与され、追加のパラメータなしに動きを学習することができる。
– ラテントシフトは非常に効率的であり、T2V生成のためにfinetunedされたにもかかわらずイメージを生成することができる。また、ラテントシフトは非常に効率的であり、比較可能またはより優れた結果を得ることができることが示されている。

要約(オリジナル)

We propose Latent-Shift — an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.

arxiv情報

著者	Jie An,Songyang Zhang,Harry Yang,Sonal Gupta,Jia-Bin Huang,Jiebo Luo,Xi Yin
発行日	2023-04-17 17:57:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー