Can MLLMs Understand the Deep Implication Behind Chinese Images?

要約

マルチモーダル大規模言語モデル (MLLM) の機能が向上し続けるにつれて、MLLM の高次の機能評価の必要性が高まっています。
しかし、中国の視覚コンテンツの高次の認識と理解に関して MLLM を評価する研究は不足しています。
ギャップを埋めるために、高次のレベルを評価することを目的とした **C**hinese **I**mage **I**mplication Understanding **Bench**mark、**CII-Bench** を導入します。
中国語画像に対する MLLM の認識および理解能力。
CII-Bench は、既存のベンチマークと比較して、いくつかの点で際立っています。
まず、中国の文脈の信頼性を確保するために、CII-Bench の画像は中国のインターネットから取得され、手動でレビューされ、対応する回答も手動で作成されます。
さらに、CII-Bench には、有名な中国の伝統絵画など、中国の伝統文化を表す画像が組み込まれており、モデルの中国の伝統文化に対する理解を深く反映することができます。
複数の MLLM にわたる CII-Bench に関する広範な実験を通じて、私たちは重要な発見をしました。
最初は、MLLM と CII-Bench 上の人間のパフォーマンスの間に大きなギャップが観察されます。
MLLM の最高精度は 64.4% に達しますが、人間の精度は平均 78.2% で、最高値は 81.0% に達します。
その後、MLLM は中国の伝統文化の画像に対してパフォーマンスが低下し、高レベルの意味論を理解する能力に限界があり、中国の伝統文化に関する深い知識ベースが不足していることを示唆しています。
最後に、画像の感情のヒントがプロンプトに組み込まれている場合、ほとんどのモデルの精度が向上することが観察されます。
私たちは、CII-Bench により、MLLM が中国語の意味論と中国特有のイメージをより深く理解し、専門的な汎用人工知能 (AGI) への道を前進できると信じています。
私たちのプロジェクトは https://cii-bench.github.io/ で公開されています。

要約(オリジナル)

As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model’s understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io/.

arxiv情報

著者	Chenhao Zhang,Xi Feng,Yuelin Bai,Xinrun Du,Jinchang Hou,Kaixin Deng,Guangzeng Han,Qinrui Li,Bingli Wang,Jiaheng Liu,Xingwei Qu,Yifei Zhang,Qixuan Zhao,Yiming Liang,Ziqiang Liu,Feiteng Fang,Min Yang,Wenhao Huang,Chenghua Lin,Ge Zhang,Shiwen Ni
発行日	2024-10-17 17:59:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can MLLMs Understand the Deep Implication Behind Chinese Images?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー