T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

要約

Text-to-Video (T2V) 生成モデルは大幅に進歩しましたが、さまざまなオブジェクト、属性、アクション、およびモーションをビデオに合成する機能はまだ解明されていません。
以前のテキストからビデオへのベンチマークでも、この重要な評価機能が無視されています。
この研究では、構成的なテキストからビデオへの生成に関する最初の体系的な研究を実施します。
私たちは、構成的なテキストからビデオへの生成に合わせて調整された最初のベンチマークである T2V-CompBench を提案します。
T2V-CompBench は、一貫した属性バインディング、動的な属性バインディング、空間関係、モーションバインディング、アクションバインディング、オブジェクトインタラクション、生成数値計算など、構成性のさまざまな側面を網羅しています。
さらに、マルチモーダル大規模言語モデル (MLLM) ベース、検出ベース、および追跡ベースのメトリクスの評価メトリクスを慎重に設計します。これにより、1,400 個のテキストプロンプトを含む 7 つの提案されたカテゴリのテキストからビデオへの合成品質をより適切に反映できます。
提案された指標の有効性は、人間の評価との相関によって検証されます。
また、さまざまなテキストからビデオへの生成モデルのベンチマークを行い、さまざまなモデルやさまざまな構成カテゴリにわたって詳細な分析を実施します。
私たちは、構成的なテキストからビデオへの生成が現在のモデルにとって非常に困難であることを発見しており、私たちの試みがこの方向の将来の研究に光を当てることができれば幸いです。

要約(オリジナル)

Text-to-video (T2V) generative models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of multimodal large language model (MLLM)-based, detection-based, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction.

arxiv情報

著者	Kaiyue Sun,Kaiyi Huang,Xian Liu,Yue Wu,Zihan Xu,Zhenguo Li,Xihui Liu
発行日	2025-01-15 18:57:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー