Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

要約

画像を生成するための現代のモデルは、驚くべき品質と多用途性を示しています。
これらの利点に揺さぶられた研究コミュニティは、それらを再利用してビデオを生成しています。
ビデオコンテンツは非常に冗長であるため、画像モデルの進歩を単純にビデオ生成領域に持ち込むと、動きの忠実度、視覚的品質が低下し、スケーラビリティが損なわれると主張します。
この作業では、これらの課題に体系的に対処するビデオファーストモデルである Snap Video を構築します。
そのために、まず EDM フレームワークを拡張して、空間的および時間的に冗長なピクセルを考慮し、ビデオ生成を自然にサポートします。
次に、画像生成の主力である U-Net はビデオ生成時にスケーリングが不十分であり、かなりの計算オーバーヘッドが必要であることを示します。
そこで、U-Net よりも 3.31 倍速くトレーニングできる (推論では最大 4.5 倍速い)、新しいトランスフォーマーベースのアーキテクチャを提案します。
これにより、数十億のパラメータを使用してテキストからビデオへのモデルを初めて効率的にトレーニングし、多くのベンチマークで最先端の結果に達し、大幅に高品質、時間的一貫性、およびモーションを備えたビデオを生成できるようになります。
複雑。
ユーザー調査では、私たちのモデルが最新の方法よりも大幅に支持されていることを示しました。
当社の Web サイト https://snap-research.github.io/snapvideo/ をご覧ください。

要約(オリジナル)

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net – a workhorse behind image generation – scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

arxiv情報

著者	Willi Menapace,Aliaksandr Siarohin,Ivan Skorokhodov,Ekaterina Deyneka,Tsai-Shien Chen,Anil Kag,Yuwei Fang,Aleksei Stoliar,Elisa Ricci,Jian Ren,Sergey Tulyakov
発行日	2024-02-22 18:55:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー