Masked Structural Growth for 2x Faster Language Model Pre-training

要約

大規模言語モデルの事前トレーニングを高速化することは、現在の研究において重要な問題です。
このペーパーでは、小さな Transformer 構造から大きな Transformer 構造に段階的に成長させることで、事前トレーニングを高速化することに焦点を当てます。
漸進的な成長には、最適な成長スケジュールの決定と効率的な成長オペレーターの設計という 2 つの主な研究課題があります。
成長スケジュールに関しては、スケジュールの効率に対する各要素の影響は、既存の研究では十分に調査されていません。
成長演算子に関しては、既存の方法は知識を継承するために新しい重みの初期化に依存しており、厳密ではない関数の保存のみを達成しており、トレーニングダイナミクスのさらなる改善は制限されています。
これらの問題に対処するために、(i) 考えられるすべての次元を含む成長スケジュールと、(ii) 新しい重みの初期化から独立した厳密に機能を保持する成長演算子を含む、マスクされた構造成長 (MSG) を提案します。
実験では、MSG が関連作業よりも大幅に高速であることが示されています。同等以上のダウンストリームパフォーマンスを維持しながら、さまざまな種類の言語モデルの事前トレーニングで最大 2.2 倍の高速化を達成します。
コードは https://github.com/cofe-ai/MSG で公開されています。

要約(オリジナル)

Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule’s efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.

arxiv情報

著者	Yiqun Yao,Zheng Zhang,Jing Li,Yequan Wang
発行日	2024-03-08 08:54:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Structural Growth for 2x Faster Language Model Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー