FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

要約

DIT拡散モデルは、テキストからビデオへの生成で大きな成功を収め、モデル容量とデータスケールでのスケーラビリティを活用しています。
ただし、テキストプロンプトと一致する高いコンテンツとモーションフィデリティは、多くの場合、大きなモデルパラメーターとかなりの数の関数評価（NFE）が必要です。
現実的で視覚的に魅力的な詳細は、通常、高解像度の出力に反映されており、特に単一段階のDITモデルの計算需要をさらに増幅します。
これらの課題に対処するために、新しい2段階のフレームワークであるFlashVideoを提案します。このFlashVideoは、生成の忠実度と品質のバランスをとるために、モデル容量とNFEを段階的に戦略的に割り当てます。
最初の段階では、計算効率を高めるのに十分なパラメーターと十分なNFEを利用した低解像度の生成プロセスを通じて、迅速な忠実度が優先されます。
第2段階では、低解像度と高解像度の間のフローマッチングを確立し、最小限のNFEで微細な詳細を効果的に生成します。
定量的および視覚的な結果は、FlashVideoが優れた計算効率で最先端の高解像度ビデオ生成を達成することを示しています。
さらに、2段階の設計により、ユーザーはフル解像度の生成にコミットする前に初期出力をプレビューすることができ、それにより、計算コストと待ち時間を大幅に削減し、商業的な実行可能性を向上させることができます。

要約(オリジナル)

DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .

arxiv情報

著者	Shilong Zhang,Wenbo Li,Shoufa Chen,Chongjian Ge,Peize Sun,Yida Zhang,Yi Jiang,Zehuan Yuan,Binyue Peng,Ping Luo
発行日	2025-02-07 18:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー