FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

要約

テキストからビデオへの編集は、テキストによるプロンプトを条件として、ソースビデオの視覚的な外観を編集することを目的としている。このタスクの主要な課題は、編集されたビデオ内のすべてのフレームが視覚的に一貫していることを保証することである。ほとんどの最近の研究では、U-Netの2次元空間的注意を時空間的注意に膨らませることで、高度なテキストから画像への拡散モデルをこのタスクに適用している。時空間注意によって時間的コンテキストを追加することは可能であるが、各パッチに無関係な情報を導入する可能性があり、そのため編集された映像に矛盾が生じる。本論文では、初めてオプティカルフローを拡散モデルのU-Netの注意モジュールに導入し、テキストから動画への編集における不整合の問題に対処する。我々の手法FLATTENは、異なるフレーム間で同じフロー経路上にあるパッチを、注意モジュールにおいて互いに注目させることで、編集されたビデオの視覚的一貫性を改善する。さらに、本手法はトレーニング不要であり、拡散に基づくテキストからビデオへの編集手法にシームレスに統合でき、その視覚的一貫性を向上させることができる。既存のテキストからビデオへの編集ベンチマークを用いた実験結果から、我々の提案手法が新たな最先端性能を達成することが示される。特に、本手法は、編集された動画の視覚的一貫性を維持することに優れている。

要約(オリジナル)

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model’s U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

arxiv情報

著者	Yuren Cong,Mengmeng Xu,Christian Simon,Shoufa Chen,Jiawei Ren,Yanping Xie,Juan-Manuel Perez-Rua,Bodo Rosenhahn,Tao Xiang,Sen He
発行日	2024-02-29 21:06:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー