WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

要約

Video Variation Autoencoder（VAE）はビデオを低次元の潜在スペースにエンコードし、モデルトレーニングコストを削減するために、ほとんどの潜在ビデオ拡散モデル（LVDMS）の重要なコンポーネントになります。
ただし、生成されたビデオの解像度と期間が増加するにつれて、ビデオVAEのエンコーディングコストは、LVDMSのトレーニングで制限的なボトルネックになります。
さらに、ほとんどのLVDMSで採用されたブロックごとの推論方法は、長時間のビデオを処理する際に潜在空間の不連続性につながる可能性があります。
計算ボトルネックに対処するための鍵は、ビデオを別々のコンポーネントに分解し、重要な情報を効率的にエンコードすることにあります。
ウェーブレット変換は、ビデオを複数の周波数ドメインコンポーネントに分解し、効率を大幅に改善する可能性があります。したがって、マルチレベルのウェーブレット変換を活用して低周波エネルギーの流れを潜在的な表現に促進する自動エンコーダーであるウェーブレットフローVAE（WF-VAE）を提案します。
さらに、ブロックごとの推論中に潜在空間の完全性を維持する因果キャッシュと呼ばれる方法を導入します。
最先端のビデオVAEと比較して、WF-VaeはPSNRとLPIPSメトリックの両方で優れたパフォーマンスを示し、競争力のある再構成の品質を維持しながら、2倍高いスループットと4倍のメモリ消費量を達成します。
私たちのコードとモデルは、https：//github.com/pku-yuangroup/wf-vaeで入手できます。

要約(オリジナル)

Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at https://github.com/PKU-YuanGroup/WF-VAE.

arxiv情報

著者	Zongjian Li,Bin Lin,Yang Ye,Liuhan Chen,Xinhua Cheng,Shenghai Yuan,Li Yuan
発行日	2025-04-11 12:31:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー