Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

要約

人間は、一般的にカクテルパーティーのシナリオと呼ばれる、複雑な音響環境の中で、目的の音源に選択的に焦点を合わせる並外れた能力を持っています。
この驚くべき聴覚注意能力を機械で再現する試みとして、ターゲット話者抽出 (TSE) モデルが開発されました。
これらのモデルは、ターゲット話者の事前登録されたキューを活用して、関心のある音源を抽出します。
ただし、現実世界のシナリオでは、潜在的な変動や、事前登録されたキューの欠如によって、これらのモデルの有効性が妨げられます。
この制限に対処するために、この研究では、既存の TSE モデルの柔軟性と制御性を強化するための自然言語の統合を調査します。
具体的には、LLM-TSEという名前のモデルを提案します。大規模言語モデル（LLM）は、ユーザーのタイプされたテキスト入力から有用な意味論的手がかりを抽出します。これは、事前に登録された手がかりを補完したり、TSEプロセスを制御するために独立して機能したりすることができます。
私たちの実験結果は、テキストベースのキューのみが提示された場合に競争力のあるパフォーマンスを示し、事前に登録された音響キューと組み合わせた場合には新しい最先端の機能が設定されました。
私たちの知る限り、これはターゲット話者の抽出をガイドするテキストベースの合図をうまく組み込んだ最初の研究であり、カクテルパーティーの問題研究の基礎となる可能性があります。

要約(オリジナル)

Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the potential variation or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language to enhance the flexibility and controllability of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) to extract useful semantic cues from the user’s typed text input, which can complement the pre-registered cues or work independently to control the TSE process. Our experimental results demonstrate competitive performance when only text-based cues are presented, and a new state-of-the-art is set when combined with pre-registered acoustic cues. To the best of our knowledge, this is the first work that has successfully incorporated text-based cues to guide target speaker extraction, which can be a cornerstone for cocktail party problem research.

arxiv情報

著者	Xiang Hao,Jibin Wu,Jianwei Yu,Chenglin Xu,Kay Chen Tan
発行日	2023-10-11 08:17:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー