Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

要約

Transformer モデルの成功により、深層学習モデルの規模は数十億のパラメーターにまで拡大しました。
ただし、単一 GPU のメモリリソースは限られているため、最適な並列戦略を選択するためのベストプラクティスはまだ不足しています。これは、ディープラーニングと並列コンピューティングの両方に関する専門知識が必要であるためです。
Colossal-AI システムは、モデルトレーニングのシーケンシャルコードを分散環境に拡張するための統合インターフェイスを導入することで、上記の課題に対処しました。
データ、パイプライン、テンソル、シーケンス並列処理などの並列トレーニング手法と、ゼロ冗長オプティマイザーと統合された異種トレーニング手法をサポートします。
ベースラインシステムと比較して、Colossal-AI は大規模モデルで最大 2.76 倍のトレーニング速度向上を達成できます。

要約(オリジナル)

The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still lacking, since it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism, as well as heterogeneous training methods integrated with zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.

arxiv情報

著者	Shenggui Li,Hongxin Liu,Zhengda Bian,Jiarui Fang,Haichen Huang,Yuliang Liu,Boxiang Wang,Yang You
発行日	2023-10-05 04:09:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー