VidTwin: Video VAE with Decoupled Structure and Dynamics

要約

ビデオオートエンコーダ (ビデオ AE) の最近の進歩により、ビデオ生成の品質と効率が大幅に向上しました。
この論文では、ビデオを 2 つの異なる潜在空間に分離する、斬新でコンパクトなビデオオートエンコーダ VidTwin を提案します。1 つはコンテンツ全体とグローバルな動きを捕捉する構造潜在ベクトル、もう 1 つはきめの細かい詳細と素早い動きを表すダイナミクス潜在ベクトルです。
。
具体的には、私たちのアプローチは、これらの潜在空間をそれぞれ抽出するための 2 つのサブモジュールで強化された Encoder-Decoder バックボーンを活用します。
最初のサブモジュールは、Q-Former を使用して低周波数のモーショントレンドを抽出し、続いてブロックをダウンサンプリングして冗長なコンテンツの詳細を削除します。
2 番目の方法では、空間次元に沿って潜在ベクトルを平均して、急速な動きをキャプチャします。
広範な実験により、VidTwin は高い再構成品質 (MCL-JCV データセットで PSNR 28.14) で 0.20% の高い圧縮率を達成し、下流の生成タスクで効率的かつ効果的に実行されることが示されています。
さらに、私たちのモデルは説明可能性と拡張性を実証し、ビデオの潜在表現と生成における将来の研究への道を開きます。
私たちのコードは https://github.com/microsoft/VidTok/tree/main/vidtwin でリリースされています。

要約(オリジナル)

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.

arxiv情報

著者	Yuchi Wang,Junliang Guo,Xinyi Xie,Tianyu He,Xu Sun,Jiang Bian
発行日	2024-12-23 17:16:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VidTwin: Video VAE with Decoupled Structure and Dynamics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー