Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

要約

テキストから画像 (T2I) 生成の成功を再現するために、テキストからビデオ (T2V) 生成の最近の作業では、大規模なテキストビデオデータセットを使用して微調整を行います。
ただし、このようなパラダイムは計算コストが高くなります。
人間には、たった 1 つの見本から新しい視覚概念を学習する驚くべき能力があります。
我々はここに新しい T2V 生成問題 $\unicode{x2014}$One-Shot Video Generation を研究し、そこではオープンドメイン T2V 生成器をトレーニングするために単一のテキストとビデオのペアのみが提示されます。
直感的に、大量の画像データで事前トレーニングされた T2I 拡散モデルを T2V 生成に適応させることを提案します。
2 つの重要な観察結果があります。1) T2I モデルは、動詞の用語とよく一致する画像を生成できます。
2) T2I モデルを拡張して複数の画像を同時に生成すると、コンテンツの一貫性が驚くほど良好になります。
連続的な動きをさらに学習するために、調整された Sparse-Causal Attention を使用した Tune-A-Video を提案します。これは、事前トレーニング済みの T2I 拡散モデルの効率的なワンショットチューニングを介して、テキストプロンプトからビデオを生成します。
Tune-A-Video は、被写体や背景の変更、属性の編集、スタイルの転送など、さまざまなアプリケーションで時間的に一貫性のあるビデオを作成することができ、この方法の汎用性と有効性を示しています。

要約(オリジナル)

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

arxiv情報

著者	Jay Zhangjie Wu,Yixiao Ge,Xintao Wang,Weixian Lei,Yuchao Gu,Wynne Hsu,Ying Shan,Xiaohu Qie,Mike Zheng Shou
発行日	2022-12-22 09:43:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー