Attention Is Not All You Need Anymore

要約

近年、人気の高い Transformer アーキテクチャは、自然言語処理やコンピュータビジョンなどの多くのアプリケーション分野で大きな成功を収めています。
既存の研究の多くは、パフォーマンスを犠牲にして、Transformer のセルフアテンションメカニズムの計算とメモリの複雑さを軽減することを目的としています。
ただし、Transformer が継続的に成功するためにはパフォーマンスが鍵となります。
この論文では、Extractors と呼ばれる、Transformer のセルフアテンションメカニズムのドロップイン置換のファミリーが提案されています。
例として、超高性能エクストラクタ (SHE)、高性能エクストラクタ (HE)、価値のあるエクストラクタ (WE)、ミニマリストエクストラクタ (ME) の 4 種類のエクストラクタを提案します。
実験結果は、自己注意メカニズムを SHE に置き換えると、トランスフォーマーのパフォーマンスが明らかに向上するのに対し、SHE の簡略化されたバージョン、つまり HE、WE、および ME はセルフアテンションメカニズムに近いか、それよりも優れたパフォーマンスを示すことを示しています。
計算とメモリの複雑さが軽減された注意メカニズム。
さらに、提案された Extractor は、計算のクリティカルパスがはるかに短いため、セルフアテンションメカニズムよりも高速に実行できる可能性があります。
さらに、テキスト生成のコンテキストにおけるシーケンス予測問題は、可変長の離散時間マルコフ連鎖を使用して定式化され、Transformer は私たちの理解に基づいてレビューされます。

要約(オリジナル)

In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a family of drop-in replacements for the self-attention mechanism in the Transformer, called the Extractors, is proposed. Four types of the Extractors, namely the super high-performance Extractor (SHE), the higher-performance Extractor (HE), the worthwhile Extractor (WE), and the minimalist Extractor (ME), are proposed as examples. Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer, whereas the simplified versions of the SHE, i.e., the HE, the WE, and the ME, perform close to or better than the self-attention mechanism with less computational and memory complexity. Furthermore, the proposed Extractors have the potential or are able to run faster than the self-attention mechanism since their critical paths of computation are much shorter. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

arxiv情報

著者	Zhe Chen
発行日	2023-09-19 13:32:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Attention Is Not All You Need Anymore

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー