LOVECon: Text-driven Training-Free Long Video Editing with ControlNet

要約

事前にトレーニングされた条件付き拡散モデルを、さらなる調整を行わずにビデオ編集に活用することは、映画制作や広告などでの有望性からますます注目を集めています。しかし、この分野の独創的な作品は、世代の長さ、時間的な一貫性、またはソースへの忠実度の点で不十分です。
ビデオ。
このペーパーは、トレーニング不要の普及モデルベースの長時間ビデオ編集のためのシンプルで効果的なベースラインを確立し、ギャップを埋めることを目的としています。
従来技術で示唆されているように、テキストプロンプトに基づくさまざまな画像編集タスクに優れた ControlNet 上にパイプラインを構築します。
限られた計算メモリによって引き起こされる長さの制約を打破するために、長いビデオを連続するウィンドウに分割し、グローバルスタイルの一貫性を確保し、ウィンドウ間の滑らかさを最大化する新しいウィンドウ間アテンションメカニズムを開発しました。
より正確な制御を実現するために、DDIM 逆変換を介してソースビデオから情報を抽出し、その結果を世代の潜在状態に統合します。
また、フレームレベルのちらつきの問題を軽減するために、ビデオフレーム補間モデルも組み込まれています。
広範な実証研究により、前景オブジェクトの属性の置換、スタイルの転送、背景の置換など、シナリオ全体で競合するベースラインよりもこの方法の優れた有効性が検証されています。
さらに、私たちの方法は、ユーザーの要件に応じて数百のフレームで構成されるビデオを編集することができます。
私たちのプロジェクトはオープンソースであり、プロジェクトページは https://github.com/zhijie-group/LOVECon にあります。

要約(オリジナル)

Leveraging pre-trained conditional diffusion models for video editing without further tuning has gained increasing attention due to its promise in film production, advertising, etc. Yet, seminal works in this line fall short in generation length, temporal coherence, or fidelity to the source video. This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. As suggested by prior arts, we build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. To break down the length constraints caused by limited computational memory, we split the long video into consecutive windows and develop a novel cross-window attention mechanism to ensure the consistency of global style and maximize the smoothness among windows. To achieve more accurate control, we extract the information from the source video via DDIM inversion and integrate the outcomes into the latent states of the generations. We also incorporate a video frame interpolation model to mitigate the frame-level flickering issue. Extensive empirical studies verify the superior efficacy of our method over competing baselines across scenarios, including the replacement of the attributes of foreground objects, style transfer, and background replacement. Besides, our method manages to edit videos comprising hundreds of frames according to user requirements. Our project is open-sourced and the project page is at https://github.com/zhijie-group/LOVECon.

arxiv情報

著者	Zhenyi Liao,Zhijie Deng
発行日	2024-05-28 07:04:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LOVECon: Text-driven Training-Free Long Video Editing with ControlNet

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー