Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

要約

我々は、VLM の効率を向上させる新しいデータプルーニング手法である単語頻度ベースの画像テキストペアプルーニング (WFPP) を提案します。
MetaCLIP とは異なり、私たちの方法では枝刈りにメタデータは必要ありませんが、テキストの内容に基づいて枝刈りするテキストと画像のペアを選択します。
具体的には、WFPP は、トレーニングデータセット全体にわたって高頻度単語を含むテキストと画像のペアをプルーニングします。
WFPP の効果は、頻繁に使用される単語の優位性を減らすことです。
その結果、データセット内のよりバランスの取れた単語頻度分布が得られ、単語埋め込みモデルのトレーニングが向上することが知られています。
プルーニングされたサブセットで事前トレーニングした後、さらに 1 エポックの間データセット全体でモデルを微調整し、より良いパフォーマンスを実現しました。
私たちの実験では、CLIP モデルのトレーニング時に WFPP を適用すると、幅広いダウンストリームタスクのパフォーマンスが向上することが実証されました。
WFPP には、使用するサンプルが少ないため、事前トレーニングが高速化されるという利点もあります。
さらに、枝刈りの前後のトレーニングデータを分析して、WFPP が単語頻度のバランスをどのように変化させるかを視覚化します。
私たちの研究が、CLIP に限定されず、VLM を事前トレーニングする際に、研究者がトレーニングデータ内の単語の分布を考慮することを奨励することを願っています。

要約(オリジナル)

We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding up pre-training by using fewer samples. Additionally, we analyze the training data before and after pruning to visualize how WFPP changes the balance of word frequencies. We hope our work encourages researchers to consider the distribution of words in the training data when pre-training VLMs, not limited to CLIP.

arxiv情報

著者	Mingliang Liang,Martha Larson
発行日	2024-12-10 13:00:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー