Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

要約

マルチモーダル大規模言語モデル (LLM) は最近、ビジュアル質問応答 (VQA) において有望なゼロショット精度を達成しました。これは、さまざまな下流のアプリケーションやドメインに影響を与える基本的なタスクです。
これらのモデルは広範に使用できる可能性が高いため、さまざまな画像や質問のプロパティを処理する際の限界を調査することが重要です。
この研究では、マルチモーダル LLM が画像内の大きな詳細だけでなく小さな詳細も認識できるかどうかを調査します。
特に、視覚的な質問に答える際のゼロショットの精度は、質問の視覚的な対象のサイズに非常に敏感であり、サイズに応じて最大 $46\%$ まで低下することを示しています。
さらに、人間の視覚的トリミングによってサイズに対する感度が大幅に軽減されることを観察することで、この効果が因果関係にあることを示しました。
人間によるトリミングの有用性に着想を得て、マルチモーダル LLM のゼロショットパフォーマンスを向上させるための推論時間メカニズムとして 3 つの自動ビジュアルトリミング方法を提案します。
私たちは、4 つの一般的な VQA データセットと、視覚的な詳細に合わせて調整された VQAv2 データセットのサブセットでその有効性を研究しました。
私たちの調査結果は、細部に敏感な VQA アプリケーションではマルチモーダル LLM を慎重に使用する必要があり、ビジュアルクロッピングがゼロショットパフォーマンスを向上させる有望な方向性であることを示唆しています。
私たちのコードとデータは公開されています。

要約(オリジナル)

Multimodal Large Language Models (LLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) — a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to $46\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose three automatic visual cropping methods as inference time mechanisms to improve the zero-shot performance of multimodal LLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that multimodal LLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. Our code and data are publicly available.

arxiv情報

著者	Jiarui Zhang,Mahyar Khayatkhoei,Prateek Chhikara,Filip Ilievski
発行日	2023-10-24 17:48:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー