Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

要約

我々は、固定メモリ消費量の下でさまざまなシーケンス長に対して一定のトレーニング速度を維持する最初の線形アテンション実装であるライトニングアテンションを紹介します。
累積加算演算 (cumsum) の問題により、以前の線形アテンションの実装では、カジュアルな設定では理論上の利点を達成できませんでした。
ただし、この問題は、さまざまな注意計算戦略を利用して注意のさまざまな部分を計算することで効果的に解決できます。
具体的には、アテンション計算をブロック内とブロック間とに分割し、ブロック内には従来のアテンション計算を使用し、ブロック間には線形アテンションカーネルトリックを使用します。
これにより、線形アテンションの計算での合計の必要がなくなります。
さらに、GPU ハードウェアを最大限に活用するために、前方手順と後方手順の両方を通じてタイリング技術が採用されています。
有効性を維持しながら精度を高めるために、当社は、当社の細心の注意を払って調整された新しいアーキテクチャである TransNormerLLM (TNL) を導入しました。
当社は、さまざまなモデルサイズとシーケンス長を使用して、標準データセットと自己収集したデータセットに対して厳密なテストを実施します。
TNL は他の言語モデルよりも著しく効率的です。
さらに、ベンチマーク結果は、TNL が従来のトランス構造を利用した最先端の LLM と同等のパフォーマンスを発揮することを示しています。
ソースコードは github.com/OpenNLPLab/TransnormerLLM で公開されています。

要約(オリジナル)

We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at github.com/OpenNLPLab/TransnormerLLM.

arxiv情報

著者	Zhen Qin,Weigao Sun,Dong Li,Xuyang Shen,Weixuan Sun,Yiran Zhong
発行日	2024-05-27 17:38:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー