How Redundant Is the Transformer Stack in Speech Representation Models?

要約

自己教師あり音声表現モデル、特にトランスアーキテクチャを活用したモデルは、音声認識、話者識別、感情検出などのさまざまなタスクにわたって顕著なパフォーマンスを実証しています。
トランスフォーマーモデルに関する最近の研究では、層間の高い冗長性と大幅な枝刈りの可能性が明らかになりました。これについては、トランスフォーマーベースの音声表現モデルについてここで調査します。
コサイン類似性、中心カーネルアライメント、および相互最近傍アライメントという 3 つの類似性メトリクスを使用して、音声表現モデルのレイヤー類似性の詳細な分析を実行します。
私たちの調査結果は、類似性の高いブロック状の構造を明らかにし、2 つの主要な処理ステップと層の大幅な冗長性を示唆しています。
我々は、ポストトレーニングを必要とせずにトランスフォーマーベースの音声表現モデルを枝刈りすることの有効性を実証し、モデルの予測能力の 95% 以上を維持しながら、トランスフォーマー層の最大 40% の削減を達成します。
さらに、知識蒸留法を採用してトランススタック全体を模倣層に置き換え、ネットワークサイズを 95 ～ 98%、推論時間を最大 94% 削減します。
この計算負荷の大幅な減少は、パフォーマンスを大幅に低下させることなく発生しており、音声表現モデルの下流アプリケーションに対して変換スタックがほぼ完全に冗長であることを示唆しています。

要約(オリジナル)

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model’s predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.

arxiv情報

著者	Teresa Dorszewski,Albert Kjøller Jacobsen,Lenka Tětková,Lars Kai Hansen
発行日	2025-01-17 12:27:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How Redundant Is the Transformer Stack in Speech Representation Models?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー