Parallelizing Linear Transformers with the Delta Rule over Sequence Length

要約

線形アテンションを備えたトランスフォーマー (つまり、線形トランスフォーマー) と状態空間モデルは、ソフトマックスアテンションを備えたトランスフォーマーに代わる実行可能な線形時間の代替手段として最近提案されています。
ただし、これらのモデルは、特にコンテキスト内での取得が必要なタスクでは依然としてトランスフォーマーのパフォーマンスを下回っています。
線形変換器の加算更新をデルタルール (DeltaNet) に置き換えた、より表現力豊かな線形変換器のバリアントは、連想再現においてより効果的であることがわかっていますが、そのようなモデルをトレーニングするための既存のアルゴリズムはシーケンスの長さにわたって並列化されないため、非効率的です。
最新のハードウェアでトレーニングします。
この研究では、ハウスホルダー行列の積を計算するためのメモリ効率の高い表現を利用する、デルタルールを使用して線形変換器をトレーニングするためのハードウェア効率の高いアルゴリズムについて説明します。
このアルゴリズムにより、DeltaNet を標準言語モデリング設定にスケールアップできます。
100B トークン用に 1.3B モデルをトレーニングしたところ、ダウンストリームタスクでの複雑さとゼロショットパフォーマンスの点で、Mamba や GLA などの最近の線形時間ベースラインよりも優れていることがわかりました。
また、DeltaNet レイヤーと、(1) スライディングウィンドウアテンションレイヤーを 1 つおきのレイヤー、または (2) 2 つのグローバルアテンションレイヤーと組み合わせた 2 つのハイブリッドモデルでも実験し、これらのハイブリッドが強力なトランスベースラインを上回るパフォーマンスを示すことを発見しました。

要約(オリジナル)

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transformer baselines.

arxiv情報

著者	Songlin Yang,Bailin Wang,Yu Zhang,Yikang Shen,Yoon Kim
発行日	2024-11-05 16:48:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー