Data Selection via Optimal Control for Language Models

要約

この研究では、下流で使用するための LM の機能を強化するために、大量のコーパスから高品質の事前トレーニングデータを選択する方法を調査します。
データ選択を一般化された最適制御問題として定式化します。これはポントリャギンの最大原理 (PMP) によって理論的に解決でき、最適なデータ選択と LM トレーニングダイナミクスの間の関係を特徴付ける一連の必要な条件が得られます。
これらの理論的結果に基づいて、PMP 条件を解決することで最適なデータ選択を近似するフレームワークである PMP ベースのデータ選択 (PDS) を紹介します。
私たちの実験では、PDS を採用して CommonCrawl からデータを選択し、PDS で選択されたコーパスが LM の学習を加速し、さまざまなモデルサイズにわたる幅広い下流タスクでパフォーマンスを常に向上させることを示しました。
さらに、PDS の利点は、スケーリング則に従ったテスト損失曲線の外挿によって証明されているように、約 10T トークンでトレーニングされた約 400B モデルまで拡張されます。
また、PDS は、事前トレーニングデータが制限されている場合でも、データ需要を 1.8 分の 1 に削減することでデータ利用率を向上させ、Web クロールされた利用可能なコーパスの急速な枯渇を軽減します。
コード、データ、モデルのチェックポイントは、https://github.com/microsoft/LMOps/tree/main/data_selection にあります。

要約(オリジナル)

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs’ capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin’s Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/data_selection.

arxiv情報

著者	Yuxian Gu,Li Dong,Hongning Wang,Yaru Hao,Qingxiu Dong,Furu Wei,Minlie Huang
発行日	2024-10-09 17:06:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data Selection via Optimal Control for Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー