Long-horizon video prediction using a dynamic latent hierarchy

要約

映像の予測・生成は非常に困難であることが知られており、この分野の研究は主に短期予測に限定されている。しかし、動画像の特徴量は時空間的に階層化されており、特徴量ごとに異なる時間的ダイナミクスを持つ。本論文では、Dynamic Latent Hierarchy (DLH) を紹介する。DLHは、動画を別々の流動的な時間スケールで進化する潜在状態の階層として表現する深層的な潜在的モデルである。各潜在状態は、直近の過去と予測される未来の2つの成分を持つ混合分布であるため、モデルは十分に異なる状態間の遷移のみを学習し、一方で時間的に持続する状態をより近くにクラスタリングすることができる。このユニークな性質を利用して、DLHはデータセットの時空間構造を自然に発見し、その階層にわたって分離された表現を学習する。これにより、ビデオの時間的ダイナミクスをモデル化するタスクが簡略化され、長期的依存性の学習が改善され、エラーの蓄積が減少すると仮定する。その証拠に、我々はDLHがビデオ予測において最新のベンチマークを上回り、確率的特性をよりよく表現できること、また、階層構造と時間構造を動的に調整できることを実証する。本論文は、特に、表現学習の進歩が予測タスクの進歩につながることを示すものである。

要約(オリジナル)

The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) — a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.

arxiv情報

著者	Alexey Zakharov,Qinghai Guo,Zafeirios Fountas
発行日	2023-01-09 18:08:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Long-horizon video prediction using a dynamic latent hierarchy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー