Benchmarking Vision Language Models for Cultural Understanding

要約

基礎モデルと視覚言語の事前トレーニングには、特に高度な視覚言語モデル (VLM) が組み込まれており、視覚データと言語データのマルチモーダル処理が可能になります。
ただし、彼らのパフォーマンスは通常、文化的な理解ではなく、一般的なシーンの理解、つまり物体、属性、アクションの認識に基づいて評価されます。
この研究では、VLM の地理的に多様な文化的理解を評価することを目的とした視覚的な質問応答ベンチマークである、CulturalVQA を紹介します。
私たちは、5 大陸 11 か国の文化を表す 2,378 個の画像と質問のペアのコレクションを厳選し、質問ごとに 1 ～ 5 個の回答を用意しています。
質問は、衣服、食べ物、飲み物、儀式、伝統などの文化のさまざまな側面の理解を探ります。
GPT-4V や Gemini などの CultureVQA で VLM をベンチマークすると、地域間で文化理解のレベルに差があり、北米では文化理解能力が高いのに対し、アフリカではパフォーマンスが大幅に低いことが明らかになりました。
文化面でもパフォーマンスに格差が見られ、食べ物や飲み物よりも衣服、儀式、伝統のほうが高いパフォーマンスを示しています。
これらの差異は、VLM が文化的理解を欠いている領域を特定するのに役立ち、多様な文化の理解における VLM の進歩を評価するための包括的な評価セットとしての CultureVQA の可能性を実証するのに役立ちます。

要約(オリジナル)

Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding – recognizing objects, attributes, and actions – rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM’s geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.

arxiv情報

著者	Shravan Nayak,Kanishk Jain,Rabiul Awal,Siva Reddy,Sjoerd van Steenkiste,Lisa Anne Hendricks,Karolina Stańczak,Aishwarya Agrawal
発行日	2024-07-15 17:21:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Vision Language Models for Cultural Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー