Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers


この論文では、ビデオの長期的な依存関係のエンドツーエンドの学習と高速な推論のために、メモリ効率の高い双方向トランスフォーマー (MeBT) を提案します。
提案された変換器は、観測可能なコンテキスト トークンを一定数の潜在トークンに射影し、クロスアテンションを通じてマスクされたトークンをデコードするように調整することにより、エンコードとデコードの両方で線形時間の複雑さを実現します。


Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed.


著者 Jaehoon Yoo,Semin Kim,Doyup Lee,Chiheon Kim,Seunghoon Hong
発行日 2023-03-20 16:35:38+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.CV パーマリンク