Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

要約

学習例間の類似性は、言語モデルの事前学習データセットをキュレートするために多くの方法で使用される。しかし、類似度の測定には、一般的な埋め込みモデルや、検索などのタスク用に訓練された埋め込みモデルが用いられる。本稿では、言語モデルの事前学習設定において、データキュレーションに特化した埋め込みモデルの適合性を分析するフレームワークを紹介する。埋め込み空間における類似度と、異なる学習例間の事前学習損失における類似度との相関を定量化し、埋め込み空間の多様化が事前学習の質にどのような影響を与えるかを明らかにする。Pileデータセットを用いて、1.7Bのパラメータを持つデコーダのみの言語モデルを事前学習する実験を行い、我々のフレームワークで様々な埋め込みモデルを分析する。我々が検討した埋め込みモデルは、全てデータキュレーションの事前学習に有用であることがわかった。さらに、トークンごとの埋め込みを平均化するという単純なアプローチは、より洗練された埋め込みモデルに対して驚くほど競争力があることがわかります。実際、我々の分析と評価の枠組みは、プリトレーニングデータセットの類似性を特に推論する埋め込みモデルを設計するための基礎として役立つと信じている。

要約(オリジナル)

Similarity between training examples is used to curate pretraining datasets for language models by many methods — for diversification and to select examples similar to high-quality data. However, similarity is typically measured with off-the-shelf embedding models that are generic or trained for tasks such as retrieval. This paper introduces a framework to analyze the suitability of embedding models specifically for data curation in the language model pretraining setting. We quantify the correlation between similarity in the embedding space to similarity in pretraining loss between different training examples, and how diversifying in the embedding space affects pretraining quality. We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1.7B parameter decoder-only language model. We find that the embedding models we consider are all useful for pretraining data curation. Moreover, a simple approach of averaging per-token embeddings proves to be surprisingly competitive with more sophisticated embedding models — likely because the latter are not designed specifically for pretraining data curation. Indeed, we believe our analysis and evaluation framework can serve as a foundation for the design of embedding models that specifically reason about similarity in pretraining datasets.

arxiv情報

著者	Dylan Sam,Ayan Chakrabarti,Afshin Rostamizadeh,Srikumar Ramalingam,Gui Citovsky,Sanjiv Kumar
発行日	2025-02-04 17:09:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー