Towards Effective and Efficient Continual Pre-training of Large Language Models

要約

継続的事前トレーニング (CPT) は、言語モデルを特定のドメインまたはタスクに適応させるための重要なアプローチです。
CPT アプローチをより追跡可能にするために、この論文では、バックボーンモデルの中国語能力と科学的推論能力を大幅に強化する Llama-3 (8B) を継続的に事前トレーニングするための技術レポートを紹介します。
本来の能力を維持しながら新たな能力を高めるために、既存のデータセットを活用し、高品質なデータセットを統合することにより、具体的なデータミックスとカリキュラム戦略を設計します。
具体的には、関連する Web ページに基づいて学際的な科学的質問と回答 (QA) のペアを合成し、その後これらの合成データを組み込んで Llama-3 の科学的推論能力を向上させます。
CPT 後のモデルを Llama-3-SynE (合成データ拡張 Llama-3) と呼びます。
また、比較的小さなモデル TinyLlama を使用した調整実験も示し、派生した結果をバックボーンモデルのトレーニングに使用します。
多数の評価ベンチマークに関する広範な実験により、私たちのアプローチにより、一般的な能力 (C-Eval で +8.81、CMMLU で +6.31) と科学的推論能力 (MATH で +12.00) の両方を含むバックボーンモデルのパフォーマンスが大幅に向上することが示されました。
SciEval では +4.13)、元の容量を損なうことなく。
私たちのモデル、データ、コードは https://github.com/RUC-GSAI/Llama-3-SynE で入手できます。

要約(オリジナル)

Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments with a relatively small model — TinyLlama, and employ the derived findings to train the backbone model. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of the backbone models, including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval), without hurting the original capacities. Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE.

arxiv情報

著者	Jie Chen,Zhipeng Chen,Jiapeng Wang,Kun Zhou,Yutao Zhu,Jinhao Jiang,Yingqian Min,Wayne Xin Zhao,Zhicheng Dou,Jiaxin Mao,Yankai Lin,Ruihua Song,Jun Xu,Xu Chen,Rui Yan,Zhewei Wei,Di Hu,Wenbing Huang,Ji-Rong Wen
発行日	2024-07-26 13:55:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Effective and Efficient Continual Pre-training of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー