Embedding And Clustering Your Data Can Improve Contrastive Pretraining

要約

テキスト埋め込みドメインにおける大規模な対比事前トレーニングに関する最近の研究では、混合ソースミニバッチではなく単一ソースミニバッチを使用すると、モデル全体の精度が大幅に向上することが示されています。
この研究では、事前トレーニング済みテキスト埋め込みモデルと古典的な K 平均法クラスタリングアルゴリズムを活用して、各ソース内のセマンティッククラスターごとにトレーニングデータをさらに分割することで、ソースの粒度を超えてトレーニングデータの層別化を拡張することを検討します。
実験的には、MSMARCO パッセージ検索データセットからのクエリとパッセージのペアで BERT ベースのテキスト埋め込みモデルを事前トレーニングすると、NDCG@10 の顕著な増加が観察されます。
さらに、私たちのクラスタリングアプローチを、TAS-B 手法のトピックアウェアサンプリング (TAS) の側面と、ANCE 手法の最近傍ベースのハードネガティブマイニングの側面の両方に概念的に接続し、この統一されたビューがどのように将来のラインを動機づけるかについて説明します。
対照的な事前学習データの構成に関する研究。

要約(オリジナル)

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

arxiv情報

著者	Luke Merrick
発行日	2024-07-26 17:36:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー