Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

要約

大規模なノイズを含むデータに対してコントラスト学習を用いて学習させた視覚言語モデルは、ゼロショット認識問題においてますます一般的になってきている。本論文では、データセットノイズ、モデルの初期化、学習目的という、対比学習パイプラインの3つの側面を改善する。まず、複雑性、アクション、テキストスポッティング(CAT)と名付けた簡単なフィルタリング戦略を提案し、データセットサイズを大幅に削減すると同時に、ゼロショット視覚言語タスク全体でパフォーマンスの改善を達成する。次に、Concept Distillationと呼ばれるアプローチを提案し、対比学習において強い単峰性表現を活用する。最後に、従来の対照的アライメントの目的を修正し、複雑性を増すことなくハードネガティブの重要性をアップサンプリングする重要度サンプリングアプローチを提案する。29タスクの広範なゼロショットベンチマークにおいて、我々のDistilled and Hard-negative Training (DiHT)アプローチは、ベースラインと比較して20タスクで改善された。さらに、少数ショット線形プロービングでは、ゼロショットと少数ショットの間のギャップを埋める新しいアプローチを提案し、先行研究よりも大幅に改善された。モデルは https://github.com/facebookresearch/diht から入手可能です。

要約(オリジナル)

Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at https://github.com/facebookresearch/diht.

arxiv情報

著者	Filip Radenovic,Abhimanyu Dubey,Abhishek Kadian,Todor Mihaylov,Simon Vandenhende,Yash Patel,Yi Wen,Vignesh Ramanathan,Dhruv Mahajan
発行日	2023-01-05 19:48:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー