InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

要約

既製のテキストから画像への潜在拡散モデルを使用して、ビデオ内のオブジェクトを挿入または置換する (修復と呼ばれる) アプローチである InVi を紹介します。
InVi は、包括的な再スタイリングやシーン全体の変更に焦点を当てた既存のビデオ編集方法とは異なり、オブジェクトの制御された操作と、それらをバックグラウンドビデオにシームレスにブレンドすることを目的としています。
この目標を達成するために、私たちは 2 つの重要な課題に取り組みます。
まず、高い品質管理とブレンドを実現するために、修復とマッチングの 2 段階のプロセスを採用しています。
このプロセスは、ControlNet ベースの修復拡散モデルを使用して単一のフレームにオブジェクトを挿入することから始まり、次に、背景とオブジェクトの間のドメインギャップを最小限に抑えるために、アンカーとして修復されたフレームからのフィーチャに条件を付けた後続のフレームを生成します。
次に、時間的な一貫性を確保するために、拡散モデルの自己注意層を拡張注意層に置き換えます。
アンカーフレームフィーチャは、これらのレイヤーのキーと値として機能し、フレーム間の一貫性を高めます。
私たちのアプローチはビデオ固有の微調整の必要性を排除し、効率的で適応性のあるソリューションを提供します。
実験結果は、InVi がフレーム間で一貫したブレンディングと一貫性を備えた現実的なオブジェクト挿入を実現し、既存の方法を上回るパフォーマンスを示していることを示しています。

要約(オリジナル)

We introduce InVi, an approach for inserting or replacing objects within videos (referred to as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled manipulation of objects and blending them seamlessly into a background video unlike existing video editing methods that focus on comprehensive re-styling or entire scene alterations. To achieve this goal, we tackle two key challenges. Firstly, for high quality control and blending, we employ a two-step process involving inpainting and matching. This process begins with inserting the object into a single frame using a ControlNet-based inpainting diffusion model, and then generating subsequent frames conditioned on features from an inpainted frame as an anchor to minimize the domain gap between the background and the object. Secondly, to ensure temporal coherence, we replace the diffusion model’s self-attention layers with extended-attention layers. The anchor frame features serve as the keys and values for these layers, enhancing consistency across frames. Our approach removes the need for video-specific fine-tuning, presenting an efficient and adaptable solution. Experimental results demonstrate that InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.

arxiv情報

著者	Nirat Saini,Navaneeth Bodla,Ashish Shrivastava,Avinash Ravichandran,Xiao Zhang,Abhinav Shrivastava,Bharat Singh
発行日	2024-07-15 17:55:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー