ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

要約

構成性は、テキストの説明から複数の概念を理解して組み合わせる能力を反映するため、Text-to-Image (T2I) モデルの重要な機能です。
構成能力の既存の評価は、人間が設計したテキストプロンプトまたは固定テンプレートに大きく依存しているため、多様性と複雑さが制限され、識別力が低くなります。
私たちは、T2I モデルの構成生成能力を自動的に評価する、スケーラブルで制御可能、カスタマイズ可能なベンチマークである ConceptMix を提案します。
これは 2 段階で行われます。
まず、ConceptMix はテキストプロンプトを生成します。具体的には、視覚概念のカテゴリ (オブジェクト、色、形状、空間関係など) を使用して、オブジェクトと視覚概念の k タプルをランダムにサンプリングし、GPT4-o を使用してテキストプロンプトを生成します。
これらのサンプリングされたコンセプトに基づいて画像を生成します。
次に、ConceptMix は、これらのプロンプトに応じて生成された画像を評価します。具体的には、ビジュアルコンセプトごとに 1 つの質問を生成し、強力な VLM を使用して回答することで、k 個のコンセプトのうち実際に画像に表示されたものがいくつあるかをチェックします。
k の値を増加させてさまざまな T2I モデル (独自モデルとオープンモデル) に ConceptMix を適用することにより、ConceptMix が以前のベンチマークよりも高い識別力を備えていることを示します。
具体的には、ConceptMix は、k が増加すると、いくつかのモデル、特にオープンモデルのパフォーマンスが劇的に低下することを明らかにしました。
重要なのは、広く使用されているトレーニングデータセットに即時の多様性が欠如していることについての洞察も提供することです。
さらに、ConceptMix の設計を検証し、自動グレーディングと人間の判断を比較するために、広範な人間による研究を実施しています。
これが将来の T2I モデル開発の指針となることを願っています。

要約(オリジナル)

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT4-o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

arxiv情報

著者	Xindi Wu,Dingli Yu,Yangsibo Huang,Olga Russakovsky,Sanjeev Arora
発行日	2024-08-26 15:08:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー