Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

要約

既存のビデオ拡散モデル（VDM）のほとんどは、単なるテキスト条件に限定されている。そのため、生成される映像の視覚的外観やジオメトリ構造を制御することができない。本研究では、画像とテキストのマルチモーダル入力を同時に条件とする新しい動画生成モデルMoonshotを紹介する。このモデルは、マルチモーダルビデオブロック（MVB）と呼ばれるコアモジュールの上に構築されており、ビデオの特徴を表現するための従来の空間・時間レイヤと、外観条件付けのための画像とテキスト入力を扱うための分離されたクロスアテンションレイヤから構成されている。さらに、先行手法とは異なり、余分な学習オーバーヘッドを必要とすることなく、幾何学的視覚条件のための事前学習済み画像ControlNetモジュールとオプションで統合できるように、モデルアーキテクチャを注意深く設計する。実験によれば、汎用性の高いマルチモーダル条件付けメカニズムにより、Moonshotは既存のモデルと比較して、視覚的品質と時間的一貫性の大幅な向上を示している。さらに、このモデルは、パーソナライズされたビデオ生成、画像アニメーション、ビデオ編集などの様々な生成アプリケーションに容易に再利用することができ、制御可能なビデオ生成のための基本的なアーキテクチャとして機能する可能性を明らかにしています。モデルはhttps://github.com/salesforce/LAVIS。

要約(オリジナル)

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

arxiv情報

著者	David Junhao Zhang,Dongxu Li,Hung Le,Mike Zheng Shou,Caiming Xiong,Doyen Sahoo
発行日	2024-01-03 16:43:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー