Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

要約

ビデオをコンパクトな潜在表現に変換するビデオトークナーは、ビデオ生成の鍵です。
既存のビデオトークナーは、VAEアーキテクチャに基づいており、エンコーダーがビデオをコンパクトな潜在性に圧縮するパラダイムに従い、決定論的デコーダーがこれらの潜伏物の元のビデオを再構築します。
この論文では、Noble \ underline {\ textbf {c}} onditiond \ unditioned \ underline {\ textbf {d}} iffusionベースのビデオ\ underline {\ textbf {t}} okenizer entititititid \ textbf {\ ourmethod}、以前の方法からのdectingのdection of to she dection by sul sed she dection
拡散モデル。
デコーダーの逆拡散生成プロセスは、エンコーダーを介して導出された潜在表現に条件付けられます。
機能のキャッシュとサンプリングの加速により、フレームワークは任意の長さの高忠実度ビデオを効率的に再構築します。
結果は、{\ ourmethod}が、単一ステップサンプリングのみを使用して、ビデオ再構成タスクで最先端のパフォーマンスを達成することを示しています。
{\ ourmethod}の小さなバージョンでさえ、上位2つのベースラインと同等の再構築結果を達成しています。
さらに、{\ ourmethod}を使用して訓練された潜在的なビデオ生成モデルも優れたパフォーマンスを示しています。

要約(オリジナル)

Video tokenizers, which transform videos into compact latent representations, are key to video generation. Existing video tokenizers are based on the VAE architecture and follow a paradigm where an encoder compresses videos into compact latents, and a deterministic decoder reconstructs the original videos from these latents. In this paper, we propose a novel \underline{\textbf{C}}onditioned \underline{\textbf{D}}iffusion-based video \underline{\textbf{T}}okenizer entitled \textbf{\ourmethod}, which departs from previous methods by replacing the deterministic decoder with a 3D causal diffusion model. The reverse diffusion generative process of the decoder is conditioned on the latent representations derived via the encoder. With a feature caching and sampling acceleration, the framework efficiently reconstructs high-fidelity videos of arbitrary lengths. Results show that {\ourmethod} achieves state-of-the-art performance in video reconstruction tasks using just a single-step sampling. Even a smaller version of {\ourmethod} still achieves reconstruction results on par with the top two baselines. Furthermore, the latent video generation model trained using {\ourmethod} also shows superior performance.

arxiv情報

著者	Nianzu Yang,Pandeng Li,Liming Zhao,Yang Li,Chen-Wei Xie,Yehui Tang,Xudong Lu,Zhihang Liu,Yun Zheng,Yu Liu,Junchi Yan
発行日	2025-03-05 17:59:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー