GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

要約

既存の視覚・言語事前学習(VLP)手法の多くは、視覚とテキストの特徴をいかに抽出し、整合させるかに主眼を置いている。しかし、我々は、画像・テキストマッチング（ITM）のためのバッチ内ハードネガティブサンプリングと、マスク言語モデリング（MLM）のための大きなマスク確率の割り当てという、事前学習中に日常的に適用される二つのステップが事前学習モデルの性能に決定的な影響を与えることを強調する。本手法は、ITMのためのハードネガティブサンプルをミニバッチで適応的に収集し、事前学習のための計算量を維持しながら、より効果的にITMのためのハードネガティブサンプルを収集するもので、上記二つのステップの予想外の効果を経験的に示した後、GRIT-VLPを系統的に考案した。本手法は3つの要素から構成される。本手法は、1)類似サンプルをミニバッチに集めるGRIT(Grouped mIni-baTch sampling)戦略、2)マイニング能力を向上させるITC一貫性損失、3)MLMにおけるマスキング確率の拡張、の3つの要素から構成される。その結果、GRIT-VLPは様々な下流タスクにおいて、より少ない計算コストで最先端の性能を達成することを示す。さらに、同じ学習データに対して3分の1の学習エポックで、従来の最先端モデルであるALBEFと同程度の性能を達成することを実証する。コードは https://github.com/jaeseokbyun/GRIT-VLP で公開されています。

要約(オリジナル)

Most of the currently existing vision and language pre-training (VLP) methods have mainly focused on how to extract and align vision and text features. In contrast to the mainstream VLP methods, we highlight that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model: in-batch hard negative sampling for image-text matching (ITM) and assigning the large masking probability for the masked language modeling (MLM). After empirically showing the unexpected effectiveness of above two steps, we systematically devise our GRIT-VLP, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM while maintaining the computational cost for pre-training. Our method consists of three components: 1) GRouped mIni-baTch sampling (GRIT) strategy that collects similar examples in a mini-batch, 2) ITC consistency loss for improving the mining ability, and 3) enlarged masking probability for MLM. Consequently, we show our GRIT-VLP achieves a new state-of-the-art performance on various downstream tasks with much less computational cost. Furthermore, we demonstrate that our model is essentially in par with ALBEF, the previous state-of-the-art, only with one-third of training epochs on the same training data. Code is available at https://github.com/jaeseokbyun/GRIT-VLP.

arxiv情報

著者	Jaeseok Byun,Taebaek Hwang,Jianlong Fu,Taesup Moon
発行日	2022-08-08 11:15:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー