No ‘Zero-Shot’ Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

要約

Web クロールされた事前トレーニングデータセットは、分類/検索用の CLIP や画像生成用の安定拡散など、マルチモーダルモデルの印象的な「ゼロショット」評価パフォーマンスの基礎となっています。
ただし、事前トレーニングデータセットが「ゼロショット」評価中に対象となる下流の概念をどの程度包含しているかが不明であるため、このようなマルチモーダルモデルにとって「ゼロショット」一般化の概念がどれほど意味があるかは不明です。
この研究では、下流の概念に対するマルチモーダルモデルのパフォーマンスが、事前トレーニングデータセットにおけるこれらの概念の頻度によってどのように影響を受けるかを問います。
私たちは、34 のモデルと 5 つの標準事前トレーニングデータセット (CC-3M、CC-12M、YFCC-15M、LAION-400M、LAION-Aesthetics) にわたってこの疑問を包括的に調査し、300GB を超えるデータアーティファクトを生成しました。
マルチモーダルモデルは、「ゼロショット」の一般化を示すどころか、下流の「ゼロショット」パフォーマンスの線形向上を達成するために、指数関数的に多くのデータを必要とし、サンプルの非効率的な対数線形スケーリング傾向に従うことが一貫してわかります。
この傾向は、事前トレーニングデータセットと下流データセットの間のサンプルレベルの類似性を制御したり、純粋に合成されたデータ分布でテストしたりする場合でも持続します。
さらに、分析に基づいてサンプリングされたロングテールデータのモデルのベンチマークを行ったところ、マルチモーダルモデル全体のパフォーマンスが低いことが実証されました。
私たちは、このロングテールテストセットを「Let it Wag!」として提供します。
この方向でさらに研究を進めるためのベンチマーク。
まとめると、私たちの研究は、トレーニングデータの急激な需要が明らかになりました。これは、大規模なトレーニングパラダイムの下での「ゼロショット」汎化機能の鍵がまだ見つかっていないことを意味します。

要約(オリジナル)

Web-crawled pretraining datasets underlie the impressive ‘zero-shot’ evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of ‘zero-shot’ generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during ‘zero-shot’ evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting ‘zero-shot’ generalization, multimodal models require exponentially more data to achieve linear improvements in downstream ‘zero-shot’ performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the ‘Let it Wag!’ benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to ‘zero-shot’ generalization capabilities under large-scale training paradigms remains to be found.

arxiv情報

著者	Vishaal Udandarao,Ameya Prabhu,Adhiraj Ghosh,Yash Sharma,Philip H. S. Torr,Adel Bibi,Samuel Albanie,Matthias Bethge
発行日	2024-10-29 14:00:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

No ‘Zero-Shot’ Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー