Why are Visually-Grounded Language Models Bad at Image Classification?

要約

画像分類は、マシンビジョンインテリジェンスの最も基本的な機能の 1 つです。
この研究では、GPT-4V や LLaVA などの視覚に基づいた言語モデル (VLM) を使用した画像分類タスクを再検討します。
既存のプロプライエタリおよびパブリック VLM は、ビジョンエンコーダとして CLIP を使用することが多く、より多くのパラメーターを備えているにもかかわらず、ImageNet などの標準的な画像分類ベンチマークでは CLIP よりも大幅にパフォーマンスが低いことがわかりました。
その理由を理解するために、VLM での推論アルゴリズム、トレーニング目標、データ処理に関するいくつかの仮説を調査します。
私たちの分析により、主な原因はデータに関連していることが明らかになりました。画像分類のための重要な情報は VLM の潜在空間にエンコードされていますが、十分なトレーニングデータがなければ効果的にデコードできません。
具体的には、VLM のトレーニングおよび指導調整中のクラスの公開頻度と、それらのクラスでの VLM のパフォーマンスの間には強い相関関係があります。
十分なデータを使用してトレーニングすると、VLM は最先端の分類モデルの精度に匹敵します。
これらの発見に基づいて、分類に焦点を当てたデータセットをトレーニングに統合することで VLM を強化し、VLM の強化された分類パフォーマンスがその一般的な機能に移行し、その結果、新しく収集された ImageWikiQA データセットで 11.8% の改善が得られることを実証します。

要約(オリジナル)

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM’s latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM’s performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

arxiv情報

著者	Yuhui Zhang,Alyssa Unell,Xiaohan Wang,Dhruba Ghosh,Yuchang Su,Ludwig Schmidt,Serena Yeung-Levy
発行日	2024-05-28 17:57:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Why are Visually-Grounded Language Models Bad at Image Classification?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー