Exploring Perceptual Limitation of Multimodal Large Language Models

要約

マルチモーダル大規模言語モデル (MLLM) は最近、視覚的な質問に答える際に顕著な知覚能力を示していますが、その知覚の限界についてはほとんど知られていません。
特に、これまでの研究では、MLLM が物体のサイズに敏感であることの逸話的な証拠が提供されてきましたが、この現象とその根本的な原因は包括的に調査されていませんでした。
この研究では、いくつかの最先端の MLLM で小さな視覚オブジェクトの知覚を定量的に研究し、画像内の小さなオブジェクトに関する質問に答える際の広範な制限を明らかにします。
次に、この制限に寄与する可能性のある 4 つの独立した要因 (オブジェクトの品質、サイズ、気が散る要因、場所) を特定し、制御された介入研究を実施して、MLLM の知覚に対する各要因の影響を測定します。
特に、オブジェクトの品質が低いこととオブジェクトのサイズが小さいことは、両方とも独立して、視覚的な質問に答えるMLLMの能力を低下させる可能性があることがわかりました。
さらに驚くべきことに、画像内の物体の位置や視覚を妨げるものの存在も、MLLM の質問応答の精度を大幅に低下させる可能性があることがわかりました。
私たちの研究は、MLLM の知覚限界についてのより良い理解を提供し、将来の MLLM の認識を分析するための新しい評価プロトコルに貢献します。
さらなる調査を促進するために、コードとデータを公開します。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs’ sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation — object quality, size, distractors, and location — and conduct controlled intervention studies to measure the effect of each factor on MLLMs’ perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs’ ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs’ question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.

arxiv情報

著者	Jiarui Zhang,Jinyi Hu,Mahyar Khayatkhoei,Filip Ilievski,Maosong Sun
発行日	2024-02-12 03:04:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Perceptual Limitation of Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー