VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

要約

ビデオとテキストのペアの品質は、基本的にテキストとビデオのモデルの上限を決定します。
現在、これらのモデルのトレーニングに使用されるデータセットには、時間的一貫性の低さ、低品質のキャプション、標準以下のビデオ品質、不均衡なデータ分布などの重大な欠点があります。
一般的なビデオキュレーションプロセスは、タグ付け用の画像モデルと手動のルールベースのキュレーションに依存しており、高い計算負荷につながり、汚れたデータが残ります。
その結果、テキストからビデオへのモデルに適切なトレーニングデータセットが不足しています。
この問題に対処するために、テキストからビデオへのモデル用の優れたトレーニングデータセットである VidGen-1M を紹介します。
このデータセットは、粗いものから細かいものまでのキュレーション戦略を通じて生成され、優れた時間的一貫性を備えた高品質のビデオと詳細なキャプションを保証します。
このデータセットをビデオ生成モデルのトレーニングに使用すると、他のモデルで得られたものを上回る実験結果が得られました。

要約(オリジナル)

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

arxiv情報

著者	Zhiyu Tan,Xiaomeng Yang,Luozheng Qin,Hao Li
発行日	2024-08-05 16:53:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー