Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

要約

SORAは、シングルシーンビデオ生成において、拡散トランス（DIT）アーキテクチャの計り知れない可能性を発表しました。
ただし、より幅広いアプリケーションを提供するマルチシーンビデオ生成のより困難なタスクは、比較的目立たないままです。
このギャップを埋めるために、マスク$^2 $ ditを提案します。これは、ビデオセグメントとそれに対応するテキスト注釈の間に微調整された1対1のアライメントを確立する新しいアプローチです。
具体的には、DITアーキテクチャ内の各注意層に対称バイナリマスクを導入し、各テキストアノテーションがそれぞれのビデオセグメントにのみ適用され、視覚トークン全体の時間的コヒーレンスを維持するようにします。
この注意メカニズムにより、正確なセグメントレベルのテキストから視聴覚へのアラインメントが可能になり、DITアーキテクチャが固定数のシーンでビデオ生成タスクを効果的に処理できます。
DITアーキテクチャに既存のシーンに基づいて追加のシーンを生成する機能をさらに装備するために、前述のビデオセグメントに新しく生成された各セグメントを条件付けるセグメントレベルの条件付きマスクを組み込み、それにより自動回帰シーンの拡張を可能にします。
定性的実験と定量的実験の両方が、マスク$^2 $ ditがセグメント間の視覚的一貫性を維持しながら、各セグメントとその対応するテキストの説明を確実に保証することを確認しています。
プロジェクトページはhttps://tianhao-qi.github.io/mask2ditprojectです。

要約(オリジナル)

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

arxiv情報

著者	Tianhao Qi,Jianlong Yuan,Wanquan Feng,Shancheng Fang,Jiawei Liu,SiYu Zhou,Qian He,Hongtao Xie,Yongdong Zhang
発行日	2025-03-25 17:46:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー