An efficient encoder-decoder architecture with top-down attention for speech separation

要約

タイトル: 階層的アテンションを用いた効率的なエンコーダーデコーダーアーキテクチャによる音声分離

要約:

– 深層ニューラルネットワークは音声分離タスクにおいて優れた成果を出しているが、低いモデル複雑度を維持しながら高い性能を得ることは、実世界の応用においては依然として課題となっている。
– 本研究では、脳のトップダウンアテンションを模倣して、モデルの複雑度を減らすことなく効率的なエンコーダーデコーダーアーキテクチャ「TDANet」を提供する。
– TDANetのトップダウンアテンションは、グローバルアテンション（GA）モジュールとカスケードローカルアテンション（LA）層によって抽出される。
– GAモジュールは、マルチスケールの音響特徴を入力とし、直接的なトップダウン接続により、異なるスケールの特徴を調節するためのグローバルアテンション信号を抽出する。
– LA層は、隣接層の特徴を入力として、上位からのローカルアテンション信号を抽出し、側面入力を調節するために使用する。
– TDANetは、3つのベンチマークデータセットで、従来の最先端（SOTA）モデルに対して競争力のある分離性能を一貫して発揮し、モデルの効率性が高い。
– 具体的には、TDANetのMAC演算は、以前のSOTAモデルの1/20であるSepformerの5％であり、CPU推論時間はSepformerの10％である。
– また、TDANetの大規模版も、MAC演算が依然としてSepformerの10％であり、CPU推論時間はSepformerの24％で、3つのデータセットでSOTAの結果を得た。

要約(オリジナル)

Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain’s top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet’s multiply-accumulate operations (MACs) are only 5\% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10\% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10\% of Sepformer and the CPU inference time only 24\% of Sepformer.

arxiv情報

著者	Kai Li,Runxuan Yang,Xiaolin Hu
発行日	2023-03-30 06:01:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

An efficient encoder-decoder architecture with top-down attention for speech separation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー