Matten: Video Generation with Mamba-Attention

要約

本稿では、ビデオ生成用の Mamba-tention アーキテクチャを備えた最先端の潜在拡散モデル、Matten を紹介します。
最小限の計算コストで、Matten はローカルビデオコンテンツモデリングには時空間的注意を採用し、グローバルビデオコンテンツモデリングには双方向 Mamba を採用します。
私たちの包括的な実験評価では、Matten がベンチマークパフォーマンスにおいて現在の Transformer ベースおよび GAN ベースのモデルと競合するパフォーマンスを備え、優れた FVD スコアと効率を達成していることが実証されています。
さらに、設計したモデルの複雑さとビデオ品質の向上の間に直接的な正の相関関係が観察され、Matten の優れたスケーラビリティが示されています。

要約(オリジナル)

In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation. With minimal computational cost, Matten employs spatial-temporal attention for local video content modeling and bidirectional Mamba for global video content modeling. Our comprehensive experimental evaluation demonstrates that Matten has competitive performance with the current Transformer-based and GAN-based models in benchmark performance, achieving superior FVD scores and efficiency. Additionally, we observe a direct positive correlation between the complexity of our designed model and the improvement in video quality, indicating the excellent scalability of Matten.

arxiv情報

著者	Yu Gao,Jiancheng Huang,Xiaopeng Sun,Zequn Jie,Yujie Zhong,Lin Ma
発行日	2024-05-10 08:30:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Matten: Video Generation with Mamba-Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー