StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

要約

長いシーケンスデータに対する言語モデルの学習は、長鎖推論などの複雑なタスクに対するモデルの能力を向上させるために必要な要件である。しかしながら、シーケンス長がスケールアップするにつれて、バックプロパゲーション(BP)プロセス中に活性化値を格納するためのメモリコストは、勾配チェックポイント技術を適用しても膨大になる。この課題に取り組むため、我々はStreamBPと呼ばれるメモリ効率の良い厳密なBP手法を提案する。この手法は、シーケンス次元に沿った連鎖ルールの線形分解をレイヤー単位で行い、活性化値とロジットのメモリコストを大幅に削減する。提案手法は、SFT、GRPO、DPOなどの一般的な目的に適用可能である。実装の観点からは、StreamBPは、言語モデルの因果構造を活用することで、より少ない計算FLOPと高速なBP速度を達成する。勾配チェックポインティングと比較すると、StreamBPはBPの最大シーケンス長を2.8～5.5倍に拡大する一方で、BPにかかる時間は同等かそれ以下です。StreamBPのシーケンス長スケーリング能力は、訓練を高速化するためのバッチサイズスケーリングにそのまま転用可能である。我々はさらに、マルチGPUトレーニングを効果的にサポートし、その適用範囲を広げるために、通信効率の良い分散StreamBPを開発する。我々のコードは、あらゆるトランスフォーマーモデルのトレーニングパイプラインに簡単に統合でき、https://github.com/Ledzy/StreamBP。

要約(オリジナル)

Training language models on long sequence data is a demanding requirement for enhancing the model’s capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP’s sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.

arxiv情報

著者	Qijun Luo,Mengqi Li,Lei Zhao,Xiao Li
発行日	2025-06-03 16:54:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー