CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy

要約

最近の研究 CLIPA は、CLIP トレーニングの逆スケーリング則を提示しています。これにより、使用される画像/テキストエンコーダーが大きくなるほど、トレーニングに適用できる画像/テキストトークンのシーケンス長が短くなります。
この発見により、大幅に削減された計算で高性能の CLIP モデルをトレーニングできるようになります。
この成果を基にして、2 つの重要な貢献を備えた CLIPA-v2 をここに紹介します。
技術的には、この逆スケーリング則は微調整段階にも適用でき、計算の必要性をさらに削減できることがわかりました。
経験的に、我々は CLIPA を大規模に調査し、トレーニング中に確認された約 13B の画像とテキストのペアを含む H/14 モデルまで実験を拡張しました。
私たちの結果は刺激的です。わずか 10,000 ドルの予算を割り当てるだけで、CLIP モデルは 81.1% という驚異的なゼロショット ImageNet 精度を達成し、これまでの最高の CLIP モデル (OpenCLIP による 80.1%) を 1.0% 上回り、同時に、
計算コストが最大 39 倍になります。
さらに、4,000 ドルの追加投資により、ゼロショット ImageNet の精度をさらに 81.8% まで高めることができます。
コードとモデルは https://github.com/UCSC-VLAA/CLIPA で入手できます。

要約(オリジナル)

The recent work CLIPA presents an inverse scaling law for CLIP training — whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting — by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

arxiv情報

著者	Xianhang Li,Zeyu Wang,Cihang Xie
発行日	2023-06-27 17:51:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー