Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

要約

大規模言語モデル (LLM) は、複雑なタスクにおいて大きな進歩を遂げましたが、その広範な採用は、相当な計算需要によって妨げられています。
トランスフォーマーベースの LLM には数千億のパラメータがあるため、ハイエンド GPU クラスター全体で数か月にわたる事前トレーニングが必要です。
しかし、この論文では、説得力のある発見が明らかになりました。変換器は事前トレーニング計算においてかなりの冗長性を示し、これが私たちが提案するソリューションである混合スパーシティトレーニング (MST) の動機付けとなっています。これは、浮動小数点演算 (FLOP) の約 $75\%$ を削減できる効率的な事前トレーニング方法です。
パフォーマンスを維持すること。
MST は、事前トレーニング中に動的スパーストレーニング (DST) をスパースバリエーション (SV) およびハイブリッドスパースアテンション (HSA) と統合し、ウォームアップ、超スパース化、および復元という 3 つの異なるフェーズを含みます。
ウォームアップフェーズでは、密なモデルを疎なモデルに変換し、復元フェーズでは接続を回復します。
これらのフェーズ全体を通じて、モデルは動的に進化するスパーストポロジと HSA メカニズムを使用してトレーニングされ、パフォーマンスを維持し、同時にトレーニングの FLOP を最小限に抑えます。
GPT-2 での実験では、パフォーマンスを損なうことなく、$4\times$ の FLOP 削減が示されました。

要約(オリジナル)

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75\%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

arxiv情報

著者	Pihe Hu,Shaolong Li,Longbo Huang
発行日	2024-08-21 16:13:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー