VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

要約

ゼロショットのカスタマイズされたビデオ生成は、その大きな応用可能性により大きな注目を集めています。
既存の方法は、ゼロショットのカスタマイズされたビデオ生成にはビデオ拡散モデル (VDM) だけでは不十分であると想定して、追加のモデルに依存して参照対象の特徴を抽出および挿入します。
ただし、これらの方法では、特徴抽出および挿入技術が最適ではないため、一貫した被写体の外観を維持するのに苦労することがよくあります。
この論文では、VDM が本質的に対象の特徴を抽出して注入する力を持っていることを明らかにします。
以前のヒューリスティックなアプローチから脱却し、VDM の固有の力を活用して高品質のゼロショットカスタマイズされたビデオ生成を可能にする新しいフレームワークを導入します。
具体的には、特徴抽出の場合、参照画像を VDM に直接入力し、その固有の特徴抽出プロセスを使用します。これにより、きめの細かい特徴が提供されるだけでなく、VDM の事前トレーニングされた知識と大幅に一致します。
特徴注入では、VDM 内の空間的自己注意を通じて、主題の特徴と生成されたコンテンツの間の革新的な双方向インタラクションを考案し、生成されたビデオの多様性を維持しながら、VDM の主題の忠実性が向上することを保証します。
カスタマイズされた人間とオブジェクトの両方のビデオ生成に関する実験により、フレームワークの有効性が検証されます。

要約(オリジナル)

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM’s inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM’s pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.

arxiv情報

著者	Tao Wu,Yong Zhang,Xiaodong Cun,Zhongang Qi,Junfu Pu,Huanzhang Dou,Guangcong Zheng,Ying Shan,Xi Li
発行日	2024-12-30 02:50:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー