Loong: Generating Minute-level Long Videos with Autoregressive Language Models

要約

分単位の長い動画を生成することは、望ましいが難しい。自己回帰型大規模言語モデル（LLM）は、自然言語処理の領域において、トークンの首尾一貫した長いシーケンスの生成において大きな成功を収めているが、動画生成のための自己回帰型LLMの探索は、数秒の短い動画の生成に限られている。本研究では、自己回帰LLMに基づく動画生成器が長い動画を生成することを妨げる課題について深い分析を行う。観察と分析に基づき、分単位の動画を生成できる新しい自己回帰型LLMベースの動画生成器Loongを提案する。具体的には、テキストトークンとビデオトークンを自己回帰LLMの統一シーケンスとしてモデル化し、ゼロからモデルを学習する。長時間の動画学習における損失の不均衡問題を緩和するために、損失再重み付けスキームを用いた漸進的な短時間から長時間の学習を提案する。さらに、ビデオトークンの再エンコードやサンプリング戦略などの推論戦略を検討し、推論中のエラー蓄積を減少させる。我々の提案するLoongは、10秒間の動画で学習可能であり、結果によって示されるように、テキストプロンプトを条件とする分レベルの長い動画を生成するために拡張可能である。より多くのサンプルはhttps://epiphqny.github.io/Loong-video。

要約(オリジナル)

It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.

arxiv情報

著者	Yuqing Wang,Tianwei Xiong,Daquan Zhou,Zhijie Lin,Yang Zhao,Bingyi Kang,Jiashi Feng,Xihui Liu
発行日	2024-10-03 17:59:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー