Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

要約

自己教師あり学習は音声タスクにおいて優れたパフォーマンスを示していますが、音声強化研究の分野では進歩の余地がまだ十分にあります。
音声タスクに取り組む際、注意メカニズムを時間的次元のみに限定すると、重要な音声特徴に効果的に焦点を当てることに限界が生じます。
前述の問題を考慮して、私たちの研究では、異質な空間特徴を巧みに統合し、騒がしい環境での音声の明瞭さと品質を大幅に向上させる二次元の注意メカニズムを組み込んだ、新しい音声強調フレームワーク HFSDA を導入します。
自己教師あり学習埋め込みを短時間フーリエ変換 (STFT) スペクトログラム機能と連携して活用することで、私たちのモデルは高レベルの意味情報と詳細なスペクトルデータの両方をキャプチャすることに優れ、音声信号のより徹底的な分析と洗練が可能になります。
さらに、スペクトログラム入力ブランチ内に革新的な全次元ダイナミックコンボリューション (ODConv) テクノロジーを採用し、複数の次元にわたる重要な情報の抽出と統合を強化します。
さらに、時間次元だけでなくスペクトル領域全体の特徴抽出機能を強化することで、Conformer モデルを改良しました。
VCTK-DEMAND データセットに関する広範な実験により、HFSDA が既存の最先端モデルと同等であることが示され、私たちのアプローチの有効性が確認されました。

要約(オリジナル)

Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introduces a novel speech enhancement framework, HFSDA, which skillfully integrates heterogeneous spatial features and incorporates a dual-dimension attention mechanism to significantly enhance speech clarity and quality in noisy environments. By leveraging self-supervised learning embeddings in tandem with Short-Time Fourier Transform (STFT) spectrogram features, our model excels at capturing both high-level semantic information and detailed spectral data, enabling a more thorough analysis and refinement of speech signals. Furthermore, we employ the innovative Omni-dimensional Dynamic Convolution (ODConv) technology within the spectrogram input branch, enabling enhanced extraction and integration of crucial information across multiple dimensions. Additionally, we refine the Conformer model by enhancing its feature extraction capabilities not only in the temporal dimension but also across the spectral domain. Extensive experiments on the VCTK-DEMAND dataset show that HFSDA is comparable to existing state-of-the-art models, confirming the validity of our approach.

arxiv情報

著者	Tao Zheng,Liejun Wang,Yinfeng Yu
発行日	2024-08-13 14:04:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー