Matching Latent Encoding for Audio-Text based Keyword Spotting

要約

キーワードスポッティング (KWS) で音声とテキストの埋め込みを併用すると、高品質の結果が得られていますが、シーケンス長の異なる複数単語のキーワードに対して 2 つの埋め込みを意味的に位置合わせする方法という重要な課題は、ほとんど未解決のままです。
この論文では、学習された音声とテキストの埋め込みに基づいて構築される、柔軟なキーワードスポッティング (KWS) のための音声テキストベースのエンドツーエンドモデルアーキテクチャを提案します。
私たちのアーキテクチャは、新しいダイナミックプログラミングベースのアルゴリズムであるダイナミックシーケンスパーティショニング (DSP) を使用し、話された内容の単調な配置を使用して、オーディオシーケンスを単語ベースのテキストシーケンスと同じ長さに最適に分割します。
私たちが提案するモデルは、オーディオとテキストのエンベディングを取得するエンコーダーブロック、個々のエンベディングを共通の潜在空間に投影するプロジェクターブロック、およびオーディオとテキストのエンベディングを調整して、
話された内容はテキストと同じです。
実験結果は、DSP が他の分割スキームよりも効果的であることを示しており、提案されたアーキテクチャは ROC 曲線下面積 (AUC) と等誤り率の点で公開データセットの最先端の結果を上回っています (
EER) は、それぞれ 14.4% と 28.9% 減少しました。

要約(オリジナル)

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.

arxiv情報

著者	Kumari Nishu,Minsik Cho,Devang Naik
発行日	2023-06-08 14:44:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Matching Latent Encoding for Audio-Text based Keyword Spotting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー