CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

要約

テキストプロンプトに基づいてビデオを生成するために設計された大規模な拡散トランスモデルである CogVideoX を紹介します。
ビデオデータを効率的にモデル化するために、3D 変分オートエンコーダー (VAE) を活用して、空間次元と時間次元の両方に沿ってビデオを圧縮することを提案します。
テキストとビデオの位置合わせを改善するために、2 つのモダリティ間の深い融合を促進するために、エキスパート適応 LayerNorm を備えたエキスパートトランスフォーマーを提案します。
CogVideoX は、プログレッシブトレーニング技術を採用することで、大きな動きを特徴とする一貫した長時間ビデオの作成に熟達しています。
さらに、さまざまなデータ前処理戦略とビデオキャプション手法を含む効果的なテキストビデオデータ処理パイプラインを開発します。
これは、CogVideoX のパフォーマンスの向上に大きく役立ち、生成品質とセマンティック調整の両方が向上します。
結果は、CogVideoX が複数のマシン指標と人間による評価の両方にわたって最先端のパフォーマンスを実証していることを示しています。
3D Causal VAE と CogVideoX の両方のモデルの重みは、https://github.com/THUDM/CogVideo で公開されています。

要約(オリジナル)

We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

arxiv情報

著者	Zhuoyi Yang,Jiayan Teng,Wendi Zheng,Ming Ding,Shiyu Huang,Jiazheng Xu,Yuanming Yang,Wenyi Hong,Xiaohan Zhang,Guanyu Feng,Da Yin,Xiaotao Gu,Yuxuan Zhang,Weihan Wang,Yean Cheng,Ting Liu,Bin Xu,Yuxiao Dong,Jie Tang
発行日	2024-08-12 11:47:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー