Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

要約

テキストからビデオへの最近の進歩（T2V）拡散モデルは、生成されたビデオの視覚的品質を大幅に向上させました。
ただし、最近のT2Vモデルでさえ、特にプロンプトが空間レイアウトまたはオブジェクトの軌跡を正確に制御する必要がある場合、テキストの説明に正確に従うことが難しいと感じています。
最近の研究ラインでは、推論時間中の注意マップの微調整または反復操作が必要なT2Vモデルのレイアウトガイダンスを使用しています。
これにより、メモリの要件が大幅に増加し、バックボーンとして大きなT2Vモデルを採用することが困難になります。
これに対処するために、マルチモーダルの計画と構造化ノイズの初期化に基づいたT2V生成のためのトレーニングなしのガイダンス方法であるVideo-MSGを紹介します。
Video-MSGは3つのステップで構成されており、最初の2つのステップでは、ビデオMSGがビデオスケッチを作成します。これは、ドラフトビデオフレームの形で、背景、前景、およびオブジェクトの軌跡を指定する最終ビデオの微調整された空間的計画を作成します。
最後のステップでは、Video-MSGは、ノイズの反転と除去を介したビデオスケッチを使用して、下流のT2V拡散モデルをガイドします。
特に、Video-MSGでは、推論時間中に追加のメモリを使用した微調整や注意操作は必要ありません。
Video-MSGは、人気のT2V生成ベンチマーク（T2VCompbenchおよびVBench）で複数のT2Vバックボーン（VideoCrafter2およびCogvideox-5B）とのテキストアラインメントの強化における有効性を示しています。
ノイズ反転比、さまざまなバックグラウンドジェネレーター、バックグラウンドオブジェクト検出、および前景オブジェクトセグメンテーションに関する包括的なアブレーション研究を提供します。

要約(オリジナル)

Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

arxiv情報

著者	Jialu Li,Shoubin Yu,Han Lin,Jaemin Cho,Jaehong Yoon,Mohit Bansal
発行日	2025-04-11 15:41:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー