Improved baselines for vision-language pre-training

要約

対照学習は、マルチモーダル表現を学習するための効率的なフレームワークとして登場しました。
この分野における独創的な研究である CLIP は、コントラスト損失を使用して画像とテキストのペアのデータをトレーニングすることにより、印象的な結果を達成しました。
最近の研究では、自己教師あり学習からインスピレーションを得た追加の非対照的な損失を使用して、CLIP よりも改善されていると主張しています。
ただし、モデルのトレーニングに使用されるデータ拡張や正則化手法など、他の実装の詳細からこれらの追加損失の寄与を解き放つのは難しい場合があります。
この問題を明らかにするために、この論文ではまず、対照学習と最近の自己教師あり学習の進歩を組み合わせることによって得られるいくつかのベースラインを提案、実装、評価します。
特に、画像とテキストのモダリティを調整するための視覚的自己教師あり学習に成功したことが証明されている損失関数を使用します。
これらのベースラインは CLIP の基本的な実装よりも優れたパフォーマンスを発揮することがわかりました。
ただし、より強力なトレーニングレシピが採用されると、その利点はなくなります。
実際、他のサブフィールドで普及しているよく知られたトレーニング手法を使用することで、単純な CLIP ベースラインも大幅に改善でき、下流のゼロショットタスクで最大 25% の相対改善が可能であることがわかりました。
さらに、以前の研究によって達成された改善のほとんどを補うには、画像とテキストの拡張を適用するだけで十分であることがわかりました。
CLIP 用に改良されたトレーニングレシピにより、大幅にシンプルになりながら、4 つの標準データセットで最先端のパフォーマンスが得られ、一貫して以前の作業を上回ります (最大のデータセットで最大 +4%)。

要約(オリジナル)

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.

arxiv情報

著者	Enrico Fini,Pietro Astolfi,Adriana Romero-Soriano,Jakob Verbeek,Michal Drozdzal
発行日	2023-05-15 14:31:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improved baselines for vision-language pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー