Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

要約

拡散変圧器（DIT）を使用して高忠実度の長いビデオを生成することは、主に注意メカニズムの計算要求のために、多くの場合、大幅な遅延によって妨げられます。
たとえば、Hunyuanvideoを使用して8秒の720pビデオ（110kトークン）を生成すると、約600個のPFLOPSが必要で、注意計算により約500個のPFLOPが消費されます。
この問題に対処するために、最初の動的パターンであるAdaspaとオンラインの正確な検索スパース注意方法を提案します。
まず、動的なパターンを実現するために、DITに固有の階層的スパース性を効率的にキャプチャするために、ブロック化されたパターンを導入します。
これは、DITのまばらな特性が、異なるモダリティの間および範囲内で階層的およびブロック化された構造を示すという観察に基づいています。
このブロック化されたアプローチは、生成されたビデオで高い忠実度を維持しながら、注意計算の複雑さを大幅に削減します。
第二に、オンラインの正確な検索を有効にするために、ヘッドに適した階層ブロックのまばらな注意を払って融合したLSEキャッシュ検索を提案します。
この方法は、ditsのまばらなパターンとlseがW.R.Tを変えるという私たちの発見によって動機付けられています。
入力、レイヤー、ヘッドですが、除去ステップ全体で不変のままです。
除去ステップ全体でこの不変性を活用することにより、DITの動的な性質に適応し、最小限のオーバーヘッドでスパースインデックスの正確でリアルタイムの識別を可能にします。
ADASPAは、適応型のプラグアンドプレイソリューションとして実装されており、既存のDITとシームレスに統合でき、追加の微調整もデータセット依存プロファイリングも必要ありません。
広範な実験では、ADASPAがビデオ品質を維持しながら、さまざまなモデルで大幅な加速を提供し、効率的なビデオ生成に対する堅牢でスケーラブルなアプローチとしての地位を確立していることを検証します。

要約(オリジナル)

Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs’ sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.

arxiv情報

著者	Yifei Xia,Suhan Ling,Fangcheng Fu,Yujie Wang,Huixia Li,Xuefeng Xiao,Bin Cui
発行日	2025-02-28 14:11:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー