FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

要約

マルチメディア生成アプローチは、人工知能研究において重要な位置を占めています。
テキストから画像へのモデルは、過去数年間にわたって高品質の結果を達成しました。
しかし、最近ではビデオ合成手法が開発され始めています。
この論文では、テキストから画像への拡散モデルに基づいた、新しい 2 段階の潜在拡散テキストからビデオへの生成アーキテクチャを示します。
最初の段階ではビデオのストーリーラインを把握するためのキーフレーム合成が行われ、2 番目の段階ではシーンやオブジェクトの動きを滑らかにするための補間フレームの生成が行われます。
キーフレーム生成のためのいくつかの時間的条件付けアプローチを比較します。
この結果は、ビデオ生成の品質面と人間の好みを反映するメトリクスの観点から、時間レイヤーではなく個別の時間ブロックを使用することの利点を示しています。
私たちの補間モデルの設計により、他のマスクされたフレーム補間アプローチと比較して、計算コストが大幅に削減されます。
さらに、一貫性を向上させ、より高い PSNR、SSIM、MSE、および LPIPS スコアを達成するために、MoVQ ベースのビデオ復号化スキームのさまざまな構成を評価します。
最後に、パイプラインを既存のソリューションと比較し、全体でトップ 2 のスコア、オープンソースソリューションの中でトップ 1 のスコアを達成しました (CLIPSIM = 0.2976 および FVD = 433.054)。
プロジェクトページ：https://ai-forever.github.io/kandinsky-video/

要約(オリジナル)

Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/

arxiv情報

著者	Vladimir Arkhipkin,Zein Shaheen,Viacheslav Vasilev,Elizaveta Dakhova,Andrey Kuznetsov,Denis Dimitrov
発行日	2023-12-20 15:58:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー