Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

要約

最近の進歩により、生成モデリングの支配的なフレームワークとして拡散変圧器（DIT）が確立されています。
この成功に基づいて、Lumina-Nextは、次のディットを使用して、フォトリアリスティックな画像の生成において並外れたパフォーマンスを実現します。
ただし、ビデオ生成の可能性はほとんど未開発のままであり、ビデオデータに固有の時空の複雑さをモデル化する上で大きな課題があります。
これに対処するために、ビデオ統合のためのテーラードソリューションを導入しながら、次のディットの強度を活用するフレームワークであるLumina-Videoを紹介します。
Lumina-Videoには、マルチスケールのネクストディットアーキテクチャが組み込まれています。これは、効率と柔軟性の両方を強化するための複数のパッチ化を共同で学習します。
モーションスコアを明示的な条件として組み込むことにより、Lumina-Videoは生成されたビデオの動的程度を直接制御することもできます。
ますます高い解像度とFPSを備えたプログレッシブトレーニングスキーム、および自然データと合成データが混在するマルチソーストレーニングスキームと組み合わせることで、Lumina-Videoは、高トレーニングと推論効率で顕著な審美的な品質と動きの滑らかさを実現します。
さらに、次のディットに基づいたビデオからオーディオモデルであるLumina-V2Aを提案して、生成されたビデオの同期サウンドを作成します。
コードはhttps://www.github.com/alpha-vllm/lumina-videoでリリースされます。

要約(オリジナル)

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos’ dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

arxiv情報

著者	Dongyang Liu,Shicheng Li,Yutong Liu,Zhen Li,Kai Wang,Xinyue Li,Qi Qin,Yufei Liu,Yi Xin,Zhongyu Li,Bin Fu,Chenyang Si,Yuewen Cao,Conghui He,Ziwei Liu,Yu Qiao,Qibin Hou,Hongsheng Li,Peng Gao
発行日	2025-02-10 18:58:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー