REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

要約

生成モデリングのためのビデオ埋め込み装置の学習に関する新しい視点を提示します。入力ビデオの正確な複製を必要とするのではなく、効果的な埋め込みは視覚的にもっともらしい再構築の合成に焦点を当てる必要があります。
このリラックスした基準により、下流の生成モデルの品質を損なうことなく、圧縮率の大幅な改善が可能になります。
具体的には、従来のエンコーダデコーダービデオエンバダーを、拡散トランス（DIT）を使用して、コンパクトな潜在スペースからの欠落の詳細を合成するエンコーダーゲネレーターフレームワークに置き換えることを提案します。
そこで、エンコードされたビデオ潜在埋め込みでDITデコーダーを調整するための専用の潜在的なコンディショニングモジュールを開発します。
私たちの実験は、私たちのアプローチが、特に圧縮率が増加するにつれて、最先端の方法と比較して優れたエンコーディングデコードパフォーマンスを可能にすることを示しています。
私たちのアプローチの有効性を実証するために、最大32倍の時間的圧縮比を達成したビデオ埋め込み者の結果を報告し（主要なビデオ埋め込みよりも8倍高い）、テキストからビデオへの生成のためのこの超コンパクトな潜在スペースの堅牢性を検証し、潜在的拡散モデルの訓練と推論に大幅な効率を高めることができます。

要約(オリジナル)

We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.

arxiv情報

著者	Yitian Zhang,Long Mai,Aniruddha Mahapatra,David Bourgin,Yicong Hong,Jonah Casebeer,Feng Liu,Yun Fu
発行日	2025-03-11 17:51:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー