FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

要約

拡散モデルは画像間 (I2I) 合成を変革し、現在ではビデオに浸透しています。
しかし、ビデオ間 (V2V) 合成の進歩は、ビデオフレーム間で時間的一貫性を維持するという課題によって妨げられてきました。
この論文では、ソースビデオ内の空間条件と時間的なオプティカルフローの手がかりを共同利用することにより、一貫した V2V 合成フレームワークを提案します。
オプティカルフローに厳密に従う従来の方法とは対照的に、私たちのアプローチはフロー推定の不完全性を処理しながらその利点を活用します。
最初のフレームからワーピングを介してオプティカルフローをエンコードし、拡散モデルの補足参照として機能します。
これにより、最初のフレームを一般的な I2I モデルで編集し、編集内容を後続のフレームに伝播することで、ビデオ合成用のモデルが有効になります。
当社の V2V モデルである FlowVid は、驚くべき特性を示しています。 (1) 柔軟性: FlowVid は既存の I2I モデルとシームレスに連携し、スタイル化、オブジェクトの交換、ローカル編集などのさまざまな変更を容易にします。
(2) 効率: 30 FPS、解像度 512×512 の 4 秒ビデオの生成にはわずか 1.5 分しかかかりません。これは、CoDeF、Rerender、TokenFlow よりそれぞれ 3.1 倍、7.2 倍、10.5 倍高速です。
(3) 高品質: ユーザー調査では、当社の FlowVid が 45.7% の確率で好まれており、CoDeF (3.5%)、Rerender (10.2%)、TokenFlow (40.4%) を上回っています。

要約(オリジナル)

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512×512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

arxiv情報

著者	Feng Liang,Bichen Wu,Jialiang Wang,Licheng Yu,Kunpeng Li,Yinan Zhao,Ishan Misra,Jia-Bin Huang,Peizhao Zhang,Peter Vajda,Diana Marculescu
発行日	2023-12-29 16:57:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー