Breadth-First Pipeline Parallelism

要約

パイプラインとデータ並列処理の組み合わせを最適化する新しいトレーニングスケジュールである幅優先パイプライン並列処理を紹介します。
幅優先パイプライン並列処理は、高い GPU 使用率と GPU あたりの小さいバッチサイズを組み合わせ、完全にシャード化されたデータ並列処理を利用することにより、トレーニング時間、コスト、メモリ使用量を削減します。
実験的には、Megatron-LM と比較して、GPU あたりの小さなバッチサイズを使用した場合、520 億パラメータのモデルのトレーニングスループットが最大 43% 増加することが観察されました。これにより、大型 GPU ではトレーニング時間とコストが同量削減されます。
集まる。

要約(オリジナル)

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

arxiv情報

著者	Joel Lamy-Poirier
発行日	2023-07-06 19:03:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Breadth-First Pipeline Parallelism

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー