Make-A-Video: Text-to-Video Generation without Text-Video Data

要約

私たちは、Make-A-Video を提案します。これは、Text-to-Image (T2I) 生成における最近の驚異的な進歩を Text-to-Video (T2V) に直接変換するためのアプローチです。
私たちの直感は単純です。世界がどのように見え、どのように記述されているかをテキストと画像のペアデータから学び、教師なしのビデオ映像から世界がどのように動いているかを学びます。
Make-A-Video には次の 3 つの利点があります。(1) T2V モデルのトレーニングを高速化する (ビジュアルおよびマルチモーダル表現をゼロから学習する必要がない)、(2) テキストとビデオのペアデータを必要としない、および (3)
) 生成されたビデオは、今日の画像生成モデルの広大さ (美的、幻想的な描写などの多様性) を継承します。
斬新で効果的な時空間モジュールを使用して、T2I モデルを構築するためのシンプルかつ効果的な方法を設計します。
まず、完全な時間 U-Net と注意テンソルを分解し、それらを空間と時間で近似します。
次に、ビデオデコーダー、補間モデル、および T2V 以外のさまざまなアプリケーションを可能にする 2 つの超解像度モデルを使用して、高解像度およびフレームレートのビデオを生成するための時空間パイプラインを設計します。
Make-A-Video は、空間的および時間的解像度、テキストへの忠実度、および品質のすべての側面において、質的および量的測定の両方によって決定される、テキストからビデオへの生成における新しい最先端技術を設定します。

要約(オリジナル)

We propose Make-A-Video — an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

arxiv情報

著者	Uriel Singer,Adam Polyak,Thomas Hayes,Xi Yin,Jie An,Songyang Zhang,Qiyuan Hu,Harry Yang,Oron Ashual,Oran Gafni,Devi Parikh,Sonal Gupta,Yaniv Taigman
発行日	2022-09-29 13:59:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Make-A-Video: Text-to-Video Generation without Text-Video Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー