Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

要約

このテクニカルレポートでは、専門家（MOE）モデルの大規模な混合をトレーニングするという課題に取り組み、そのようなシステムで一般的なコストの非効率性とリソースの制限を克服することに焦点を当てています。
これらの問題に対処するために、2つの異なるサイズのMOE大言語モデル（LLMS）、すなわちLing-LiteとLing-Plus（中国語では「bailing」と呼ばれる、ピンインではB \ v {a} il \ ‘ingと呼ばれる）を提示します。
Ling-Liteには、27億5,000万のアクティブ化されたパラメーターを備えた168億パラメーターが含まれていますが、Ling-Plusは2,900億パラメーターを備えており、288億パラメーターを備えています。
どちらのモデルも、主要な業界のベンチマークに匹敵するパフォーマンスを示します。
このレポートは、リソースに制約のある設定におけるAI開発の効率とアクセシビリティを改善するための実用的な洞察を提供し、よりスケーラブルで持続可能なテクノロジーを促進します。
具体的には、大規模なMOEモデルのトレーニングコストを削減するために、（1）モデルアーキテクチャとトレーニングプロセスの最適化、（2）トレーニング異常処理の改良、および（3）モデル評価効率の強化のための革新的な方法を提案します。
さらに、知識グラフから生成された高品質のデータを活用して、私たちのモデルは、他のモデルと比較してツールの使用に優れた機能を示しています。
最終的に、我々の実験的調査結果は、300BのMOE LLMを低パフォーマンスデバイスで効果的にトレーニングできることを示していますが、密集したMOEモデルやMOEモデルを含む同様のスケールのモデルに匹敵するパフォーマンスを達成できます。
高性能デバイスと比較して、トレーニング前の段階でより低い仕様ハードウェアシステムを利用すると、大幅なコスト削減が示され、コンピューティングコストが約20％削減されます。
モデルはhttps://huggingface.co/inclusionaiでアクセスできます。

要約(オリジナル)

In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as ‘Bailing’ in Chinese, spelled B\v{a}il\’ing in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at https://huggingface.co/inclusionAI.

arxiv情報

著者	Ling Team,Binwei Zeng,Chao Huang,Chao Zhang,Changxin Tian,Cong Chen,Dingnan Jin,Feng Yu,Feng Zhu,Feng Yuan,Fakang Wang,Gangshan Wang,Guangyao Zhai,Haitao Zhang,Huizhong Li,Jun Zhou,Jia Liu,Junpeng Fang,Junjie Ou,Jun Hu,Ji Luo,Ji Zhang,Jian Liu,Jian Sha,Jianxue Qian,Jiewei Wu,Junping Zhao,Jianguo Li,Jubao Feng,Jingchao Di,Junming Xu,Jinghua Yao,Kuan Xu,Kewei Du,Longfei Li,Lei Liang,Lu Yu,Li Tang,Lin Ju,Peng Xu,Qing Cui,Song Liu,Shicheng Li,Shun Song,Song Yan,Tengwei Cai,Tianyi Chen,Ting Guo,Ting Huang,Tao Feng,Tao Wu,Wei Wu,Xiaolu Zhang,Xueming Yang,Xin Zhao,Xiaobo Hu,Xin Lin,Yao Zhao,Yilong Wang,Yongzhen Guo,Yuanyuan Wang,Yue Yang,Yang Cao,Yuhao Fu,Yi Xiong,Yanzhe Li,Zhe Li,Zhiqiang Zhang,Ziqi Liu,Zhaoxin Huan,Zujie Wen,Zhenhang Sun,Zhuoxuan Du,Zhengyu He
発行日	2025-03-10 14:21:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー