Faster Video Diffusion with Trainable Sparse Attention

要約

スケーリングビデオ拡散変圧器（DITS）は、ほとんどの注意質量が位置の小さなサブセットに集中しているにもかかわらず、2次3Dの注意によって制限されます。
この観察結果をVSAに変えます。VSAは、\ emphare {両方の}トレーニングと推論で完全な注意を置き換えるトレーニング可能でハードウェア効率の高いまばらな注意です。
VSAでは、軽量の粗いステージがトークンをタイルにプールし、高重量\ emph {クリティカルトークン}を識別します。
細かいステージは、トークンレベルの注意を計算します。これらのタイルは、コンピューティングレイアウトをブロックするためにブロックするタイルの内側のみを計算し、効率が硬くなります。
これにより、エンドツーエンドをトレーニングし、事後プロファイリングを必要とせず、Flashattention3 MFUの85％を維持する単一の微分可能なカーネルにつながります。
60mから1.4bのパラメーターまでのDITを前処理することにより、アブレーション研究とスケーリング法の実験の大規模なスイープを実行します。
VSAは、拡散損失の減少なしにトレーニングフロップを2.53 $ \ Times $削減するパレートポイントに到達します。
オープンソースWAN-2.1モデルを改造すると、注意時間が6ドル\ Times $を速め、31秒から18秒までエンドツーエンドの生成時間を低下させます。
これらの結果は、完全な注意の実用的な代替手段として、ビデオ拡散モデルのさらなるスケーリングのための重要なイネーブラーとして、訓練可能なまばらな注意を確立します。

要約(オリジナル)

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.

arxiv情報

著者	Peiyuan Zhang,Haofeng Huang,Yongqi Chen,Will Lin,Zhengzhong Liu,Ion Stoica,Eric P. Xing,Hao Zhang
発行日	2025-05-19 17:30:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Faster Video Diffusion with Trainable Sparse Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー