SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

要約

現代の音声処理システムは自己注意に依存しています。
残念ながら、自己注意を伴うトークンの混合には音声発話の長さの 2 次時間がかかり、推論だけでなくトレーニングも遅くなり、メモリ消費量が増加します。
ASR に対する自己注意のより安価な代替手段が開発されていますが、同じレベルの精度を一貫して達成することはできません。
したがって、この論文は、自己注意に代わる新しい線形時間の方法を提案します。
すべてのタイムステップのベクトルの平均を使用して発話を要約します。
この単一の概要は、時間固有の情報と組み合わされます。
この方法を「サマリーミキシング」と呼びます。
最先端の ASR モデルに SummaryMixing を導入すると、以前の音声認識パフォーマンスを維持または上回ることが可能になり、トレーニングと推論の時間を最大 28$\%$ 削減し、メモリバジェットを 2 分の 1 に削減できます。
SummaryMixing の利点は、音声理解などの他の音声処理タスクにも一般化できます。

要約(オリジナル)

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ‘SummaryMixing’. Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while lowering the training and inference times by up to 28$\%$ and reducing the memory budget by a factor of two. The benefits of SummaryMixing can also be generalized to other speech-processing tasks, such as speech understanding.

arxiv情報

著者	Titouan Parcollet,Rogier van Dalen,Shucong Zhang,Sourav Bhattacharya
発行日	2024-01-17 16:12:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー