CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

要約

視覚コンテンツの生成のツールとしてのテキストからイメージ（T2I）モデルの普及の増加は、多様な文化的文脈を正確に表現する能力に関する懸念を引き起こします。
この作業では、明示的および暗黙的な文化的期待の両方に関して、T2Iモデルと評価メトリックのアラインメントを体系的に定量化する最初の研究を提示します。
この目的のために、視覚世代における文化的表現の厳格な人間の評価のために設計された新しいベンチマークであるCulturalFramesを紹介します。
10か国と5つの社会文化的領域にまたがるCulturalFramesは、983のプロンプト、4つの最先端のT2Iモデルによって生成された3637の対応する画像、および10Kを超える詳細な人間の注釈で構成されています。
T2Iモデルは、より挑戦的な暗黙の期待に応えることに失敗するだけでなく、それほど挑戦的ではない明示的な期待にも及ぶことがあることがわかります。
モデルと国で、文化的期待は平均44％の時間を逃しています。
これらの失敗の中で、明示的な期待は驚くほど高い平均率68％で見逃されますが、暗黙の期待の障害も重要であり、平均49％です。
さらに、既存のT2I評価メトリックが、内部の推論に関係なく、文化的整合の人間の判断と相関していないことを実証します。
まとめて、私たちの調査結果は重要なギャップを明らかにし、より文化的に情報に基づいたT2Iモデルと評価方法を開発するための実用的な方向を提供します。

要約(オリジナル)

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

arxiv情報

著者	Shravan Nayak,Mehar Bhatia,Xiaofeng Zhang,Verena Rieser,Lisa Anne Hendricks,Sjoerd van Steenkiste,Yash Goyal,Karolina Stańczak,Aishwarya Agrawal
発行日	2025-06-10 14:21:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー