Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

要約

大規模な言語モデル（LLM）のトレーニングは、通常、トレーニング時間を短縮するために多数の加速器に分布しています。
内部状態とパラメーター勾配は、すべての勾配ステップで交換する必要があるため、すべてのデバイスは、必要な大量の交換ビットをサポートするために、低遅延の高帯域幅通信リンクを使用して共同配置する必要があります。
最近、Dilocoのような分散アルゴリズムはそのような共同ロケーションの制約を緩和しました。加速器は「労働者」にグループ化できます。
これは、学習品質に影響を与えることなく、より低い帯域幅通信リンクによって接続される余裕があることを意味します。
ただし、これらの方法では、労働者間のコミュニケーションには、同期ではすべての労働者ですべてのパラメーターを交換する必要があるため、以前と同じピーク帯域幅が必要です。
この論文では、ディロコを3つの方法で改善します。
まず、パラメーターのサブセットのみを順番に同期し、一度にすべてではなく、ピーク帯域幅を大幅に減らします。
第二に、労働者は同期中にトレーニングを継続できるようにします。これにより、壁の時計時間が短縮されます。
第三に、労働者によって交換されたデータを量子化し、労働者の帯域幅をさらに削減します。
これらの修正を適切に組み合わせることで、10億個のパラメーターのトレーニングを配布し、以前と同様の品質に達することができるが、必要な帯域幅を2桁減らすことができることを実験的に示します。

要約(オリジナル)

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into “workers”, where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

arxiv情報

著者	Arthur Douillard,Yanislav Donchev,Keith Rush,Satyen Kale,Zachary Charles,Zachary Garrett,Gabriel Teston,Dave Lacey,Ross McIlroy,Jiajun Shen,Alexandre Ramé,Arthur Szlam,Marc’Aurelio Ranzato,Paul Barham
発行日	2025-01-30 17:23:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー