VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

要約

最近、拡散ベースの生成モデルは、画像の生成と編集において目覚ましい成功を収めています。
ただし、ビデオ編集での使用には依然として重要な制限があります。
この文書では、強力な時間的および空間的一貫性を保証する、ゼロショットテキストベースのビデオ編集のための新しい方法である VidEdit を紹介します。
まず、アトラスベースの事前トレーニング済みのテキストから画像への拡散モデルを組み合わせて、トレーニング不要で効率的な編集方法を提供し、設計により時間的な滑らかさを実現することを提案します。
次に、既製のパノプティックセグメンタとエッジ検出器を活用し、それらの使用を条件付き拡散ベースのアトラス編集に適応させます。
これにより、元のビデオの構造を厳密に保持しながら、ターゲット領域の空間を細かく制御できます。
定量的および定性的な実験により、VidEdit は、セマンティックの忠実性、画像の保存、時間的一貫性のメトリクスに関して、DAVIS データセットに対する最先端の手法よりも優れたパフォーマンスを発揮することが示されています。
このフレームワークを使用すると、1 つのビデオの処理にかかる時間はわずか約 1 分で、独自のテキストプロンプトに基づいて互換性のある複数の編集を生成できます。
プロジェクトの Web ページ (https://videdit.github.io)

要約(オリジナル)

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

arxiv情報

著者	Paul Couairon,Clément Rambour,Jean-Emmanuel Haugeard,Nicolas Thome
発行日	2023-12-08 15:37:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー