ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training

要約

大規模な言語モデル（LLMS）は、さまざまな自然言語処理タスクで顕著なパフォーマンスを実証しています。
ただし、これらのモデルのトレーニングは、特にトランスベースのLLMの重要なコンポーネントである注意メカニズムでは、計算的に集中的で障害の影響を受けやすいです。
このホワイトペーパーでは、LLMトレーニングに対する障害の影響を調査し、系統的障害注入実験を使用して、計算結果のINF、NAN、および近接値に焦点を当てています。
これらのエラーの伝播パターンを観察します。これにより、モデル内の訓練不可能な状態をトリガーし、トレーニングを混乱させ、チェックポイントから手順を強制します。
これらの障害の影響を軽減するために、LLMSの注意メカニズムに合わせた最初のアルゴリズムベースの断層許容度（ABFT）技術であるAttncheckerを提案します。
Attncheckerは、LLMの断層伝播パターンに基づいて設計されており、パフォーマンスの最適化を組み込んで、システムの信頼性とモデルの脆弱性の両方に適応しながら、高速LLMトレーニングに軽量保護を提供します。
4つのLLMの評価は、Attncheckerがすべての極端なエラーを検出および修正しながら、トレーニングで平均7％のオーバーヘッドで発生することを示しています。
最先端のチェックポイント/復元アプローチと比較して、attncheckerはリカバリのオーバーヘッドを最大49倍削減します。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints. To mitigate the impact of these faults, we propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49x.

arxiv情報

著者	Yuhang Liang,Xinyi Li,Jie Ren,Ang Li,Bo Fang,Jieyang Chen
発行日	2025-01-29 18:49:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー