RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

要約

視聴覚音声分離手法は、さまざまなモダリティを統合して高品質の分離音声を生成し、それによって音声認識などの下流タスクのパフォーマンスを向上させることを目的としています。
既存の最先端 (SOTA) モデルのほとんどは時間領域で動作します。
ただし、音響特徴をモデル化するためのあまりに単純すぎるアプローチでは、SOTA パフォーマンスを達成するために、より大規模でより計算量の多いモデルが必要になることがよくあります。
この論文では、新しい時間周波数領域のオーディオビジュアル音声分離方法であるリカレント時間周波数分離ネットワーク (RTFS-Net) を紹介します。これは、短時間フーリエ変換によって生成される複雑な時間周波数ビンにそのアルゴリズムを適用します。
。
オーディオの時間と周波数の次元を、各次元に沿って多層 RNN を使用して個別にモデル化し、キャプチャします。
さらに、オーディオ情報とビジュアル情報を効率的に統合するための独自の注意ベースの融合技術と、より明確な分離のために音響特徴の固有のスペクトル特性を利用する新しいマスク分離アプローチを導入します。
RTFS-Net は、パラメータの 10% と MAC の 18% のみを使用する以前の SOTA メソッドよりも優れたパフォーマンスを発揮します。
これは、現代のすべての時間領域の対応物を上回る、初めての時間周波数領域のオーディオビジュアル音声分離方法です。

要約(オリジナル)

Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

arxiv情報

著者	Samuel Pegg,Kai Li,Xiaolin Hu
発行日	2024-01-18 15:06:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー