Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

要約

テキストからイメージ（T2I）生成モデルはユビキタスになっていますが、特定のプロンプトに沿った画像を必ずしも生成するわけではありません。
以前の研究では、人間の判断を収集するためのメトリック、ベンチマーク、およびテンプレートを提案することによりT2Iアライメントを評価していますが、これらのコンポーネントの品質は体系的に測定されていません。
一般に、人間の評価のプロンプトセットは小さく、評価の信頼性（モデルを比較するために使用されるプロンプトセット）は評価されません。
このギャップに対処し、自動平均メトリックと人間のテンプレートを評価する広範な研究を実行します。
3つの主な貢献を提供します。（1）異なる人間のテンプレート全体でモデルを区別できる包括的なスキルベースのベンチマークを紹介します。
このスキルベースのベンチマークカテゴリは、プロンプトをサブスキルに分類し、実践者がどのスキルが挑戦的であるかだけでなく、スキルがどのレベルの複雑さで挑戦的になるかを特定できるようにします。
（2）4つのテンプレートと4つのT2Iモデルにわたって人間の評価を収集して、合計100kの注釈を付けます。
これにより、プロンプトの固有のあいまいさのために違いがどこで発生するか、およびメトリックとモデルの品質の違いによりそれらがどこに発生するかを理解することができます。
（3）最後に、新しいデータセット、異なる人間のテンプレート、TIFA160にわたって既存のメトリックよりも人間の評価とよりよく相関する新しいQAベースの自動平均メトリックを導入します。

要約(オリジナル)

While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings — and thereby the prompt set used to compare models — is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

arxiv情報

著者	Olivia Wiles,Chuhan Zhang,Isabela Albuquerque,Ivana Kajić,Su Wang,Emanuele Bugliarello,Yasumasa Onoe,Pinelopi Papalampidi,Ira Ktena,Chris Knutsen,Cyrus Rashtchian,Anant Nawalgaria,Jordi Pont-Tuset,Aida Nematzadeh
発行日	2025-03-17 15:53:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー