The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

要約

テキストからビデオへの最近の進歩（T2V）の生成は、自己回帰言語モデルと拡散モデルの2つの競合するパラダイムによって推進されています。
ただし、各パラダイムには本質的な制限があります。言語モデルは視覚的な品質とエラーの蓄積に苦労していますが、拡散モデルには意味的理解と因果モデリングがありません。
この作業では、粗からファインの生成を通じて両方のパラダイムの強さを相乗するハイブリッドフレームワークであるLandiffを提案します。
私たちのアーキテクチャは、3つの重要な革新を導入しています。（1）効率的なセマンティック圧縮により、3D視覚機能をコンパクトな1D離散表現に圧縮し、$ \ SIM $ 14,000 $ \ times $ $圧縮比を達成するセマンティックトークネイザー。
（2）高レベルのセマンティック関係を持つセマンティックトークンを生成する言語モデル。
（3）粗いセマンティクスを高忠実度のビデオに改良するストリーミング拡散モデル。
実験では、5BモデルであるLandiffがVBench T2Vベンチマークで85.43のスコアを達成し、最先端のオープンソースモデルHunyuanビデオ（13b）やSora、Keling、Hailuoなどの他の商用モデルを上回ることが示されています。
さらに、私たちのモデルは、この分野の他のオープンソースモデルを上回る、長いビデオ生成で最先端のパフォーマンスも達成しています。
デモはhttps://landiff.github.io/で見ることができます。

要約(オリジナル)

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

arxiv情報

著者	Aoxiong Yin,Kai Shen,Yichong Leng,Xu Tan,Xinyu Zhou,Juncheng Li,Siliang Tang
発行日	2025-03-06 16:53:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー