Edit Temporal-Consistent Videos with Image Diffusion Model

要約

大規模なテキストから画像への (T2I) 拡散モデルがテキストガイド付きビデオ編集用に拡張され、印象的なゼロショットビデオ編集パフォーマンスが得られます。
それにもかかわらず、ビデオの時間特性が忠実にモデル化されていないため、生成されたビデオには通常、空間的な不規則性や時間的な不一致が見られます。
この論文では、堅牢なテキストガイド付きビデオ編集における時間的不一致の課題を軽減する、エレガントかつ効果的な時間的一貫性のあるビデオ編集 (TCVE) 手法を提案します。
空間コンテンツ操作のための事前トレーニング済み T2I 2D Unet の利用に加えて、入力ビデオシーケンスの時間コヒーレンスを忠実にキャプチャするための専用の時間 Unet アーキテクチャを確立します。
さらに、空間に焦点を当てたコンポーネントと時間に焦点を当てたコンポーネント間の一貫性と相互関係を確立するために、一貫した時空間モデリングユニットが定式化されます。
このユニットは、時間的な Unet を事前トレーニングされた 2D Unet と効果的に相互接続し、それにより、ビデオコンテンツ操作の容量を維持しながら、生成されたビデオの時間的な一貫性を強化します。
定量的な実験結果と視覚化の結果は、TCVE がビデオの時間的一貫性とビデオ編集機能の両方において最先端のパフォーマンスを達成し、この分野の既存のベンチマークを上回っていることを示しています。

要約(オリジナル)

Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated videos while preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

arxiv情報

著者	Yuanzhi Wang,Yong Li,Xiaoya Zhang,Xin Liu,Anbo Dai,Antoni B. Chan,Zhen Cui
発行日	2023-12-30 04:09:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Edit Temporal-Consistent Videos with Image Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー