SSGVS: Semantic Scene Graph-to-Video Synthesis

要約

画像合成タスクの自然な拡張として、ビデオ合成が最近多くの関心を集めています。
多くの画像合成作品は、ガイダンスとしてクラスラベルまたはテキストを使用します。
ただし、ラベルもテキストも、アクションの開始時や終了時などの明示的な一時的なガイダンスを提供することはできません。
この制限を克服するために、セマンティックビデオシーングラフをビデオ合成の入力として導入します。これは、シーン内のオブジェクト間の空間的および時間的関係を表すためです。
ビデオシーングラフは通常、時間的に離散的な注釈であるため、既存のビデオシーングラフをエンコードするだけでなく、ラベルのないフレームのグラフ表現も予測するビデオシーングラフ (VSG) エンコーダーを提案します。
VSG エンコーダーは、さまざまな対照的なマルチモーダル損失で事前にトレーニングされています。
事前トレーニング済みの VSG エンコーダー、VQ-VAE、および自己回帰トランスフォーマーに基づくセマンティックシーングラフからビデオへの合成フレームワーク (SSGVS) は、初期シーン画像と非固定数が与えられたビデオを合成するために提案されています。
セマンティックシーングラフの。
アクションゲノムデータセットでSSGVSおよびその他の最先端のビデオ合成モデルを評価し、ビデオ合成におけるビデオシーングラフの重要性を実証します。
ソースコードが公開されます。

要約(オリジナル)

As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code will be released.

arxiv情報

著者	Yuren Cong,Jinhui Yi,Bodo Rosenhahn,Michael Ying Yang
発行日	2022-11-11 11:02:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SSGVS: Semantic Scene Graph-to-Video Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー