Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

要約

利用可能な大規模なトレーニングコーパス内の品質のばらつきを考慮すると、大規模な言語モデルを事前トレーニングする場合、データの選択は非常に重要です。
これを達成するために、研究者らは現在、データインスタンスの重要性を測定するためにデータの影響を使用することを研究しています。つまり、影響スコアが高いということは、このインスタンスをトレーニングセットに組み込むことでモデルのパフォーマンスが向上する可能性が高いことを示しています。
その結果、最高スコアを持つ上位 $k$ インスタンスが選択されます。
ただし、このアプローチにはいくつかの制限があります。
(1) 利用可能なすべてのデータの影響を計算するには時間がかかります。
(2) 選択されたデータインスタンスは十分に多様ではないため、事前トレーニングされたモデルがさまざまな下流タスクに効果的に一般化する能力が妨げられる可能性があります。
この論文では、データの影響を利用して最先端の事前トレーニング結果を達成することで、品質と多様性の両方を考慮したデータ選択アプローチである \texttt{Quad} を紹介します。
特に、アテンション層が広範な意味論的詳細を捕捉することに留意し、加速化された $iHVP$ 計算手法をアテンション層に適用し、データの影響、つまりその品質を評価する能力を強化しました。
多様性を確保するために、 \texttt{Quad} はデータセットを各クラスター内の同様のデータインスタンスと、異なるクラスター間の多様なインスタンスにクラスター化します。
各クラスターについて、そこからデータを選択することを選択した場合、すべてのインスタンスの処理を防ぐための影響を評価するためにいくつかのサンプルを採取します。
どのクラスターを選択するかを決定するために、古典的なマルチアームバンディット法を利用し、各クラスターをアームとして扱います。
このアプローチでは、影響力の高いインスタンスを持つクラスター (高品質を確保) または選択頻度が低いクラスター (多様性を確保) が優先されるため、品質と多様性のバランスが適切に保たれます。

要約(オリジナル)

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, $i.e.,$ a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-$k$ instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model’s ability to generalize effectively to various downstream tasks. In this paper, we introduce \texttt{Quad}, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pre-training results. In particular, noting that attention layers capture extensive semantic details, we have adapted the accelerated $iHVP$ computation methods for attention layers, enhancing our ability to evaluate the influence of data, $i.e.,$ its quality. For the diversity, \texttt{Quad} clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. To determine which clusters to select, we utilize the classic Multi-Armed Bandit method, treating each cluster as an arm. This approach favors clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity.

arxiv情報

著者	Chi Zhang,Huaping Zhong,Kuan Zhang,Chengliang Chai,Rui Wang,Xinlin Zhuang,Tianyi Bai,Jiantao Qiu,Lei Cao,Ye Yuan,Guoren Wang,Conghui He
発行日	2024-09-25 14:49:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー