Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

要約

〜\ textsc {domain2vec}を紹介します。これは、データセットの重要な基礎機能をキャプチャするために設計された新しい概念であるいくつかの\ emph {meta-domains}の線形組み合わせに任意のデータセットを分解する新しいアプローチです。
\ textsc {domain2vec}は、メタドメインの語彙を維持し、分類子を使用して、特定のデータセットをこの語彙の分布に対応するドメインベクトルに分解します。
これらのドメインベクトルは、言語モデル（LM）の最適なデータ混合物の識別を、\ emphed {\ textbf {d} istribution \ textbf {a} lignment \ textbf {a} ssumption}（da $^{2} $ {2} $）を把握したことを示唆する\ textbf {a} ssumption}（da $ sumptionbf {a} ssumption} lignment \ textbf {a} ssumption \ textbf {a} ssumption}）の識別を可能にします。
調整されて、より低い検証損失が達成されます。
さらに、\ textSc {domain2vec}は、以前の作品にシームレスに統合されて、ドメインベクターとLMパフォーマンスの関係をモデル化し、以前の方法の効率とスケーラビリティを大幅に向上させることができます。
広範な実験では、\ textsc {domain2vec}が、最小限の計算オーバーヘッドで下流のタスクのパフォーマンスを向上させるデータ混合物を見つけるのに役立つことを示しています。
具体的には、\ textsc {domain2vec}は、Pileデータセットの元の混合物でトレーニングするときに必要な計算の51.5ドル\％$のみを使用して、Pile-CCで同じ検証損失を達成します。
同等の計算予算では、\ textsc {domain2vec}は、平均2.83ドル\％$だけダウンストリームパフォーマンスを向上させます。

要約(オリジナル)

We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.

arxiv情報

著者	Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu
発行日	2025-06-12 17:53:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー