VidTwin: Video VAE with Decoupled Structure and Dynamics

要約

ビデオ自動エンコーダー（ビデオAE）の最近の進歩により、ビデオ生成の品質と効率が大幅に向上しました。
このホワイトペーパーでは、ビデオを2つの異なる潜在スペースに切り離す斬新でコンパクトなビデオ自動エンコーダーVidtwinを提案します。構造全体のコンテンツとグローバルな動きをキャプチャする構造潜在ベクトル、および微粒子の詳細と迅速な動きを表すダイナミクス潜在ベクトル。
具体的には、私たちのアプローチは、これらの潜在スペースをそれぞれ抽出するためにそれぞれ2つのサブモジュールで増強されたエンコーダーデコーダーバックボーンを活用します。
最初のサブモジュールは、Q-formerを採用して低周波モーショントレンドを抽出し、続いてダウンサンプリングブロックを使用して冗長コンテンツの詳細を削除します。
2番目は、空間寸法に沿って潜在的なベクトルを平均して、急速な動きをキャプチャします。
広範な実験では、Vidtwinが高い再構築品質（MCL-JCVデータセットで28.14のPSNR）で0.20％の高い圧縮率を達成し、下流の生成タスクで効率的かつ効果的に実行することが示されています。
さらに、私たちのモデルは説明可能性とスケーラビリティを示し、ビデオの潜在的な表現と生成の将来の研究への道を開いています。
詳細については、プロジェクトページを確認してください：https：//vidtwin.github.io/。

要約(オリジナル)

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Check our project page for more details: https://vidtwin.github.io/.

arxiv情報

著者	Yuchi Wang,Junliang Guo,Xinyi Xie,Tianyu He,Xu Sun,Jiang Bian
発行日	2025-03-28 17:32:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VidTwin: Video VAE with Decoupled Structure and Dynamics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー