INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

要約

大規模な事前学習済み言語モデル（PTLM）の顕著な特徴は、モデル容量と事前学習データセットサイズの増加に伴い、その汎化能力が著しく向上し、新たな能力が出現することである。その結果、私たちは、最先端を行く巨大なモデルの開発を目の当たりにしています。しかし、その結果、膨大な学習時間、膨大な計算コスト、環境負荷が発生することが避けられません。モデルアーキテクチャ、トレーニングパイプライン、損失関数設計の革新を通じて、PTLMのトレーニングをより効率的にするための多大な努力が行われていますが、トレーニングデータの有用性を最適化することにはほとんど注意が払われていません。本研究では、訓練データのうち情報量の多い部分集合のみを用いてPTLMを訓練することで、下流の性能を維持することが可能かどうかを問います。本論文では、情報量の多いデータ部分集合の選択に関する最近の進歩に基づき、サブモジュラー最適化を用いてトレーニングコーパスの代表性の高い部分集合を選択する方法を示す。その結果、提案されたフレームワークを適用して、わずかなデータを使って、複数のPTLM（BERT、BioBERT、GPT-2）を効率的に訓練でき、完全に訓練したモデルの性能を最大$sim99％保持できることが実証された。

要約(オリジナル)

A salient characteristic of large pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora. Our results demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data while retaining up to $\sim99\%$ of the performance of the fully-trained models.

arxiv情報

著者	H S V N S Kowndinya Renduchintala,Krishnateja Killamsetty,Sumit Bhatia,Milan Aggarwal,Ganesh Ramakrishnan,Rishabh Iyer,Balaji Krishnamurthy
発行日	2023-05-11 09:24:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー