Image Captioners Are Scalable Vision Learners Too

要約

Web からの画像とテキストのペアに対する対照的な事前トレーニングは、特に大規模なマルチモーダルモデルのコンテキストにおいて、視覚バックボーンのための最も一般的な大規模な事前トレーニング戦略の 1 つです。
同時に、このタイプのデータに対する画像キャプションは、一般に、劣った事前トレーニング戦略であると考えられています。
このペーパーでは、これら 2 つの事前トレーニング戦略を公平に比較し、トレーニングデータ、コンピューティング、およびモデルの能力を慎重に照合します。
標準的なエンコーダデコーダトランスフォーマを使用すると、キャプションだけでも驚くほど効果的であることがわかりました。分類タスクでは、キャプションは対照的に事前学習されたエンコーダと競合するビジョンエンコーダを生成し、視覚と言語タスクではそれらを上回ります。
さらに、モデルのアーキテクチャとスケール、および表現品質に対する事前トレーニングデータの影響を分析し、キャプションがこれらの軸に沿って同等またはより優れたスケーリング動作を示すことを発見しました。
全体として、私たちの結果は、単純な画像キャプションがこれまで考えられていたよりも強力な事前トレーニング戦略であることを示しています。

要約(オリジナル)

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

arxiv情報

著者	Michael Tschannen,Manoj Kumar,Andreas Steiner,Xiaohua Zhai,Neil Houlsby,Lucas Beyer
発行日	2023-11-09 10:39:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Image Captioners Are Scalable Vision Learners Too

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー