Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

要約

1,350億パラメーターとAscend Neural Processingユニット（NPU）で訓練された密なトランスモジュールを備えた大規模な言語モデル（LLM）であるPangu Ultraを紹介します。
LLMの分野は、近年、LLMのスケールと能力を推進することに前例のない進歩を目撃していますが、このような大規模なモデルをトレーニングするには、依然として大きな最適化とシステムの課題が含まれます。
トレーニングプロセスを安定させるために、深いモデルのトレーニングプロセス中に損失スパイクを効果的に排除する深さスケールのサンドイッチ正規化を提案します。
モデルを13.2兆個の多様で高品質のトークンで事前に訓練し、トレーニング後の推論機能をさらに強化します。
このような大規模なトレーニングを効率的に実行するために、一連のシステム最適化で8,192 Ascend NPUを利用します。
複数の多様なベンチマークの評価は、Pangu UltraがLlama 405BやMistral Large 2などの密なLLMの最先端の機能を大幅に進歩させ、DeepSeek-R1との競争結果を達成することさえ、より多くのモデル構造がより多くのパラメーターを含むことを示しています。
私たちの調査は、Ascend NPUが1,000億以上のパラメーターで密なモデルを効率的かつ効果的にトレーニングできることを示しています。
私たちのモデルとシステムは、商業顧客が利用できるようになります。

要約(オリジナル)

We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.

arxiv情報

著者	Yichun Yin,Wenyong Huang,Kaikai Song,Yehui Tang,Xueyu Wu,Wei Guo,Peng Guo,Yaoyuan Wang,Xiaojun Meng,Yasheng Wang,Dong Li,Can Chen,Dandan Tu,Yin Li,Fisher Yu,Ruiming Tang,Yunhe Wang,Baojun Wang,Bin Wang,Bo Wang,Boxiao Liu,Changzheng Zhang,Duyu Tang,Fei Mi,Hui Jin,Jiansheng Wei,Jiarui Qin,Jinpeng Li,Jun Zhao,Liqun Deng,Lin Li,Minghui Xu,Naifu Zhang,Nianzu Zheng,Qiang Li,Rongju Ruan,Shengjun Cheng,Tianyu Guo,Wei He,Wei Li,Weiwen Liu,Wulong Liu,Xinyi Dai,Yonghan Dong,Yu Pan,Yue Li,Yufei Wang,Yujun Li,Yunsheng Ni,Zhe Liu,Zhenhe Zhang,Zhicheng Liu
発行日	2025-04-11 07:47:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー