Training-Free Efficient Video Generation via Dynamic Token Carving

要約

ビデオ拡散トランス（DIT）モデルの顕著な生成品質にもかかわらず、それらの実用的な展開は、広範な計算要件によって厳しく妨げられています。
この非効率性は、2つの重要な課題に由来しています。トークンの長さと拡散モデルの多段階的な性質に関する自己触たちの二次的な複雑さです。
これらの制限に対処するために、動的な注意の彫刻とプログレッシブ解像度の生成を組み合わせた新しい推論パイプラインであるJengaを提示します。
私たちのアプローチは、2つの重要な洞察を活用しています。（1）早期の除去ステップには高解像度の潜在性が必要ありません。
Jengaは、3Dスペース充填曲線を使用して関連するトークン相互作用を動的に選択するブロックワイズの注意メカニズムと、世代中に潜在的な解像度を徐々に増加させるプログレッシブ解像度戦略を導入します。
実験結果は、Jengaが同等の生成品質（Vbenchで0.01 \％のパフォーマンス低下で8.83 $ \ Times $ speedup）を維持しながら、複数の最先端のビデオ拡散モデルでかなりのスピードアップを達成することを示しています。
プラグアンドプレイソリューションとして、JENGAは、モデル再トレーニングを必要とせずに、推論時間を数分から数秒に短縮することにより、最新のハードウェアで実用的で高品質のビデオ生成を可能にします。
コード：https：//github.com/dvlab-research/jenga

要約(オリジナル)

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds — without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

arxiv情報

著者	Yuechen Zhang,Jinbo Xing,Bin Xia,Shaoteng Liu,Bohao Peng,Xin Tao,Pengfei Wan,Eric Lo,Jiaya Jia
発行日	2025-05-22 16:21:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Training-Free Efficient Video Generation via Dynamic Token Carving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー