xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

要約

テキストの説明からリアルなシーンを生成できるテキストからビデオ (T2V) 生成モデルである xGen-VideoSyn-1 を紹介します。
OpenAI の Sora などの最近の進歩に基づいて、潜在拡散モデル (LDM) アーキテクチャを調査し、ビデオ変分オートエンコーダー (VidVAE) を導入します。
VidVAE はビデオデータを空間的および時間的両方で圧縮し、ビジュアルトークンの長さと、長いシーケンスのビデオの生成に伴う計算需要を大幅に削減します。
計算コストにさらに対処するために、ビデオセグメント全体で時間的一貫性を維持する分割およびマージ戦略を提案します。
当社の拡散トランスフォーマー (DiT) モデルには、空間的および時間的セルフアテンションレイヤーが組み込まれており、さまざまなタイムフレームやアスペクト比にわたって堅牢な一般化が可能になります。
私たちは最初からデータ処理パイプラインを考案し、1,300 万を超える高品質のビデオとテキストのペアを収集しました。
パイプラインには、クリッピング、テキスト検出、動き推定、審美性スコアリング、社内ビデオ LLM モデルに基づく高密度キャプションなどの複数のステップが含まれています。
VidVAE モデルと DiT モデルのトレーニングには、それぞれ約 40 日と 642 H100 日かかりました。
当社のモデルは、14 秒を超える 720p ビデオの生成をエンドツーエンドでサポートし、最先端の T2V モデルと比較して競争力のあるパフォーマンスを実証します。

要約(オリジナル)

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI’s Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

arxiv情報

著者	Can Qin,Congying Xia,Krithika Ramakrishnan,Michael Ryoo,Lifu Tu,Yihao Feng,Manli Shu,Honglu Zhou,Anas Awadalla,Jun Wang,Senthil Purushwalkam,Le Xue,Yingbo Zhou,Huan Wang,Silvio Savarese,Juan Carlos Niebles,Zeyuan Chen,Ran Xu,Caiming Xiong
発行日	2024-08-22 17:55:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー