Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

要約

視覚言語モデル (VLM) の開発における最近の進歩は、構成的な画像理解の印象的な例を含む、視覚的意味論的コンテンツの認識において目覚ましい成功を収めています。
ここでは、視覚的データタイプ識別という新しいタスクを紹介します。これは、データキュレーション (大規模なデータセットからのノイズの多いデータの削除、ドメイン固有の検索など) と自律的視覚 (変化する気象条件とデータの区別など) に影響を与える基本的な知覚スキルです。
カメラレンズの汚れ）。
私たちは、4 つの広範なカテゴリにわたる 27 種類の視覚データの多様なセットにわたって変更された動物画像からなる 2 つのデータセットを開発しました。
100M から 80B のパラメーターにわたる 39 個の VLM の広範なゼロショット評価により、微妙なパフォーマンスの状況がわかります。
VLM は、漫画やスケッチなど、特定の文体 \textit{データ型} を識別するのにはかなり優れていますが、画像の回転や付加的なノイズなどの基本的な操作から生じる単純なデータ型には苦労します。
私たちの調査結果では、(i) モデルのスケーリングだけでは、CLIP のような対照的にトレーニングされたモデルではわずかな向上しか得られず、(ii) OpenFlamingo のような最大の自己回帰トレーニングされた VLM ではパフォーマンスが顕著に低下することが明らかになりました。
この発見は、現在のフロンティア VLM の盲点を示しています。つまり、VLM はセマンティックコンテンツの認識には優れていますが、スケーリングを通じて視覚的なデータタイプを理解することができません。
これらのモデルのトレーニング前の分布を分析し、微調整中にキャプションにデータ型情報を組み込むことにより、パフォーマンスの大幅な向上を実現します。
これまで未知のタスクを探索することで、VLM をさらに進化させ、視覚的なデータ型の理解を備えさせるための準備を整えることを目指しています。
コードとデータセットは https://github.com/bethgelab/DataTypeIdentification でリリースされます。

要約(オリジナル)

Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.

arxiv情報

著者	Vishaal Udandarao,Max F. Burg,Samuel Albanie,Matthias Bethge
発行日	2023-12-06 12:34:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー