Exploiting Transformer Activation Sparsity with Dynamic Inference

要約

変圧器モデルは、その優れたパフォーマンスにもかかわらず、高い計算要件が原因で実際的な制限に直面することがよくあります。
同時に、以前の研究では、これらのモデルにおける顕著な活性化の希薄性が明らかになり、冗長な計算の存在が示されています。
この論文では、動的スパース化トランスフォーマー推論 (DSTI) を提案します。これは、活性化スパース性を強制し、その後、密なモデルをそのスパースな Mixture of Experts (MoE) バージョンに変換することによって、Transformer モデルの推論コストを大幅に削減する方法です。
我々は、推論中に各専門家の相対的な貢献を首尾よく予測する小規模なゲーティングネットワークを訓練することが可能であることを実証します。
さらに、トークンごとに実行されるエキスパートの数を個別に動的に決定する仕組みを導入します。
DSTI はあらゆる Transformer ベースのアーキテクチャに適用でき、精度への影響は無視できます。
BERT ベースの分類モデルでは、推論コストがほぼ 60% 削減されます。

要約(オリジナル)

Transformer models, despite their impressive performance, often face practical limitations due to their high computational requirements. At the same time, previous studies have revealed significant activation sparsity in these models, indicating the presence of redundant computations. In this paper, we propose Dynamic Sparsified Transformer Inference (DSTI), a method that radically reduces the inference cost of Transformer models by enforcing activation sparsity and subsequently transforming a dense model into its sparse Mixture of Experts (MoE) version. We demonstrate that it is possible to train small gating networks that successfully predict the relative contribution of each expert during inference. Furthermore, we introduce a mechanism that dynamically determines the number of executed experts individually for each token. DSTI can be applied to any Transformer-based architecture and has negligible impact on the accuracy. For the BERT-base classification model, we reduce inference cost by almost 60%.

arxiv情報

著者	Mikołaj Piórczyński,Filip Szatkowski,Klaudia Bałazy,Bartosz Wójcik
発行日	2023-10-06 16:34:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploiting Transformer Activation Sparsity with Dynamic Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー