Text-driven Video Prediction

要約

現在の映像生成モデルは、通常、入力（画像、テキストなど）や潜在空間（ノイズベクトルなど）から受け取った外観や動きを示す信号を連続したフレームに変換し、潜在コードのサンプリングによってもたらされる不確実性に対する確率的生成過程を満たしています。しかし、この生成パターンは、外観と動きの両方に対する決定論的制約を欠いており、制御不能で望ましくない結果をもたらす。このため、我々はテキスト駆動型動画像予測（TVP）と呼ばれる新しいタスクを提案する。このタスクは、最初のフレームとテキストキャプションを入力とし、次のフレームを合成することを目的とする。具体的には、画像とキャプションから外観と運動成分が別々に提供される。TVPタスクに取り組む鍵は、テキスト記述に含まれる潜在的な運動情報を十分に探索することで、もっともらしい動画の生成を容易にすることにある。実際、このタスクは本質的に因果関係のある問題であり、テキストの内容はフレームの動き変化に直接影響する。そこで、本論文では、テキスト推論モジュール(TIM)を用いて、テキストがどのような因果関係を持つかを調べ、その結果を用いて、次のフレームの動きに関する推論を行う方法を提案する。特に、グローバルな運動セマンティクスを取り入れた絞り込み機構により、一貫性のある生成が保証される。実験では、Something-Something V2およびSingle Moving MNISTデータセットを用いて、広範な実験を行った。実験結果は、我々のモデルが他のベースラインよりも良い結果を達成することを示し、提案するフレームワークの有効性を検証する。

要約(オリジナル)

Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image, text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing step-wise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.

arxiv情報

著者	Xue Song,Jingjing Chen,Bin Zhu,Yu-Gang Jiang
発行日	2022-10-06 12:43:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Text-driven Video Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー