LLM-grounded Video Diffusion Models

要約

テキスト条件付き拡散モデルは、ニューラルビデオ生成の有望なツールとして浮上しています。
しかし、現在のモデルは依然として複雑な時空間プロンプトに苦戦しており、制限された動きや不正確な動きを生成することがよくあります（たとえば、左から右に移動するオブジェクトに対してプロンプトを表示する機能さえ欠如しています）。
これらの制限に対処するために、LLM ベースのビデオ拡散 (LVD) を導入します。
LVD は、テキスト入力から直接ビデオを生成するのではなく、まず大規模言語モデル (LLM) を利用してテキスト入力に基づいて動的なシーンレイアウトを生成し、その後、生成されたレイアウトを使用してビデオ生成の拡散モデルをガイドします。
私たちは、LLM がテキストだけから複雑な時空間ダイナミクスを理解し、プロンプトと現実世界で通常観察されるオブジェクトの動きのパターンの両方に密接に一致するレイアウトを生成できることを示します。
次に、アテンションマップを調整することで、これらのレイアウトでビデオ拡散モデルをガイドすることを提案します。
私たちのアプローチはトレーニング不要で、分類子のガイダンスを許可するあらゆるビデオ普及モデルに統合できます。
私たちの結果は、LVD が、必要な属性とモーションパターンを備えたビデオを忠実に生成する際に、そのベースビデオ拡散モデルやいくつかの強力なベースライン手法を大幅に上回るパフォーマンスを示していることを示しています。

要約(オリジナル)

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

arxiv情報

著者	Long Lian,Baifeng Shi,Adam Yala,Trevor Darrell,Boyi Li
発行日	2023-09-29 17:54:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLM-grounded Video Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー