SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

要約

現代の音声処理システムは自己注意に依存しています。
残念ながら、自己注意を伴うトークンの混合には音声発話の長さの 2 次時間がかかり、推論とトレーニングが遅くなり、メモリ消費が増加します。
ASR に対する自己注意のより安価な代替手段が開発されていますが、同じレベルの精度を一貫して達成することはできません。
したがって、この論文は、自己注意に代わる新しい線形時間の方法を提案します。
すべてのタイムステップのベクトルの平均を使用して発話を要約します。
この単一の概要は、時間固有の情報と組み合わされます。
この方法を「サマリーミキシング」と呼びます。
最先端の ASR モデルに SummaryMixing を導入すると、以前の音声認識パフォーマンスを維持または上回ることが可能になり、トレーニングと推論を最大 28% 高速化し、メモリ使用量を半分に削減できます。

要約(オリジナル)

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ‘SummaryMixing’. Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

arxiv情報

著者	Titouan Parcollet,Rogier van Dalen,Shucong Zhang,Sourav Bhattacharya
発行日	2024-07-11 09:20:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー