ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

要約

ディープラーニングモデルと入力データが前例のない速度でスケーリングされているため、分散トレーニングプラットフォームに移行してモデルに適合し、トレーニングスループットを向上させることは避けられません。
ウェーハスケールノード、多次元ネットワークトポロジ、細分化されたメモリシステム、並列化戦略などの最先端のアプローチと手法は、新興の分散型トレーニングシステムで積極的に採用されています。
これにより、分散型トレーニングの複雑な SW/HW 協調設計スタックが発生し、設計空間の調査のためのモデリング/シミュレーションインフラストラクチャが必要になります。
このホワイトペーパーでは、オープンソースの ASTRA-sim インフラストラクチャを拡張し、最先端の新しい分散トレーニングモデルとプラットフォームをモデル化する機能を提供します。
より具体的には、(i) ASTRA-sim がグラフベースのトレーニングループの実装を介して任意のモデル並列化戦略をサポートできるようにします。(ii) ターゲットシステムを
スケール、および (iii) メモリシステムモデリングを強化して、ネットワーク内の集団通信と細分化されたメモリシステムの正確なモデリングをサポートします。
このような機能により、新しい分散モデルとプラットフォームを対象とした包括的なケーススタディを実施しています。
このインフラストラクチャにより、システム設計者は、複雑な協調設計スタックを迅速に横断し、分散トレーニングプラットフォームを大規模に設計および展開する際に有意義な洞察を得ることができます。

要約(オリジナル)

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.

arxiv情報

著者	William Won,Taekyung Heo,Saeed Rashidi,Srinivas Sridharan,Sudarshan Srinivasan,Tushar Krishna
発行日	2023-03-24 14:00:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー