One-Minute Video Generation with Test-Time Training

要約

今日のトランスフォーマーは、自己触媒層が長いコンテキストでは非効率的であるため、1分間のビデオを生成するのに苦労しています。
マンバ層などの代替品は、隠された状態がそれほど表現力が低いため、複雑なマルチシーンストーリーと格闘しています。
テスト時間トレーニング（TTT）レイヤーを実験します。その隠れた状態自体がニューラルネットワークになる可能性があるため、より表現力があります。
TTTレイヤーを事前に訓練したトランスに追加すると、テキストストーリーボードから1分間のビデオを生成できます。
概念実証については、トムとジェリーの漫画に基づいてデータセットをキュレートします。
Mamba〜2、ゲートデルタネット、スライドウィンドウの注意レイヤーなどのベースラインと比較して、TTTレイヤーは複雑なストーリーを伝えるよりコヒーレントなビデオを生成し、方法ごとに100ビデオの人間の評価で34のELOポイントをリードしています。
有望ですが、結果はまだ訓練された5Bモデルの能力が限られているため、依然としてアーティファクトが含まれています。
実装の効率も改善できます。
リソースの制約のために1分間のビデオを実験しましたが、アプローチはより長いビデオやより複雑なストーリーに拡張できます。
サンプルビデオ、コード、アノテーションは、https：//test-time-training.github.io/video-ditで入手できます

要約(オリジナル)

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

arxiv情報

著者	Karan Dalal,Daniel Koceja,Gashon Hussein,Jiarui Xu,Yue Zhao,Youjin Song,Shihao Han,Ka Chun Cheung,Jan Kautz,Carlos Guestrin,Tatsunori Hashimoto,Sanmi Koyejo,Yejin Choi,Yu Sun,Xiaolong Wang
発行日	2025-04-07 17:56:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

One-Minute Video Generation with Test-Time Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー