SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

要約

最新の Transformer のコストのかかるセルフアテンションレイヤーはメモリを必要とし、シーケンス長の 2 次を計算します。
既存の近似手法は通常、パフォーマンスが低く、実際には大幅な高速化が得られません。
ここでは、同じパラメータバジェットでベースライン Transformer の言語モデリングパフォーマンスと一致させながら、コンピューティング要件とメモリ要件の両方を削減し、実時間の高速化を達成する新しい方法である SwitchHead を紹介します。
SwitchHead は、値と出力の投影に Mixture-of-Experts (MoE) レイヤーを使用し、標準の Transformer よりも 4 ～ 8 倍少ないアテンションマトリックスを必要とします。
私たちの新しい注目は、MoE MLP レイヤーと組み合わせることもでき、その結果、効率的な完全に MoE の「SwitchHead」トランスモデルが得られます。
私たちのコードは公開されています。

要約(オリジナル)

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead – a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE ‘SwitchHead’ Transformer model. Our code is public.

arxiv情報

著者	Róbert Csordás,Piotr Piękos,Kazuki Irie
発行日	2023-12-13 09:00:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー