Gated Delta Networks: Improving Mamba2 with Delta Rule

要約

線形トランスフォーマーは、標準トランスフォーマーの効率的な代替手段として注目を集めていますが、検索や長いコンテキストのタスクにおけるパフォーマンスは限られています。
これらの制限に対処するために、最近の研究では、適応メモリ制御のためのゲーティングと正確なメモリ変更のためのデルタ更新ルールという 2 つの異なるメカニズムが検討されています。
これらのメカニズムは補完的であることがわかります。ゲーティングにより迅速なメモリ消去が可能になり、デルタルールによりターゲットを絞った更新が容易になります。
この洞察に基づいて、ゲートデルタルールを導入し、最新のハードウェアに最適化された並列トレーニングアルゴリズムを開発します。
私たちが提案するアーキテクチャである Gated DeltaNet は、言語モデリング、常識的推論、コンテキスト内検索、長さの外挿、および長いコンテキストの理解などの複数のベンチマークにわたって、Mamba2 や DeltaNet などの既存のモデルを常に上回っています。
私たちは、Gated DeltaNet レイヤーとスライディングウィンドウアテンションまたは Mamba2 レイヤーを組み合わせたハイブリッドアーキテクチャを開発することでパフォーマンスをさらに強化し、トレーニング効率の向上と優れたタスクパフォーマンスの両方を実現します。

要約(オリジナル)

Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

arxiv情報

著者	Songlin Yang,Jan Kautz,Ali Hatamizadeh
発行日	2024-12-09 13:09:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Gated Delta Networks: Improving Mamba2 with Delta Rule

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー