Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

要約

Transformer モデルの成功により、深層学習モデルの規模は数十億のパラメーターにまで押し上げられました。
単一の GPU のメモリリソースは限られているため、最適な並列戦略を選択するためのベストプラクティスはまだ不足しています。ディープラーニングと並列コンピューティングの両方の分野の専門知識が必要になるからです。
Colossal-AI システムは、モデルトレーニングのシーケンシャルコードを分散環境にスケーリングする統合インターフェイスを導入することで、上記の課題に対処しました。
データ、パイプライン、テンソル、シーケンスの並列処理などの並列トレーニングメソッド、およびゼロ冗長オプティマイザーと統合された異種トレーニングメソッドをサポートします。
ベースラインシステムと比較して、Colossal-AI は大規模モデルで最大 2.76 倍のトレーニング速度向上を達成できます。

要約(オリジナル)

The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still lacking, since it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism, as well as heterogeneous training methods integrated with zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.

arxiv情報

著者	Shenggui Li,Jiarui Fang,Zhengda Bian,Hongxin Liu,Yuliang Liu,Haichen Huang,Boxiang Wang,Yang You
発行日	2022-09-20 12:54:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー