Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

要約

Text-to-Video (T2V) 生成の進歩にもかかわらず、リアルな動きを持つビデオを作成することは依然として困難です。
現在のモデルは静的な出力、または最小限の動的出力を生成することが多く、テキストで記述された複雑な動きをキャプチャできません。
この問題は、モーションを見落とすテキストエンコーディングの内部バイアスと、T2V 生成モデルの不適切な調整メカニズムに起因します。
これに対処するために、我々は DEcomused MOtion (DEMO) と呼ばれる新しいフレームワークを提案します。このフレームワークは、テキストエンコーディングとコンディショニングの両方をコンテンツとモーションコンポーネントに分解することで、T2V 生成におけるモーション合成を強化します。
私たちの方法には、静的要素用のコンテンツエンコーダと時間的ダイナミクス用のモーションエンコーダが、別個のコンテンツおよびモーションコンディショニングメカニズムとともに含まれています。
重要なのは、モデルのモーションの理解と生成を改善するために、テキストモーションとビデオモーションの監視を導入することです。
MSR-VTT、UCF-101、WebVid-10M、EvalCrafter、VBench などのベンチマークでの評価は、高いビジュアル品質を維持しながらモーションダイナミクスが強化されたビデオを生成する DEMO の優れた能力を実証しています。
私たちのアプローチは、テキストの説明から直接包括的な動作の理解を統合することにより、T2V 生成を大幅に進歩させます。
プロジェクトページ：https://PR-Ryan.github.io/DEMO-project/

要約(オリジナル)

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model’s understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO’s superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https://PR-Ryan.github.io/DEMO-project/

arxiv情報

著者	Penghui Ruan,Pichao Wang,Divya Saxena,Jiannong Cao,Yuhui Shi
発行日	2024-10-31 17:59:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー