Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

要約

この論文では、音声セグメンテーションの教師なしアプローチを紹介します。これは、以前に研究されたアプローチ、たとえば話者ダイアライゼーションに基づいて構築され、同時に音響意味論的区別の包括的なセットに適用可能であり、一般的な教師なし音声セグメンテーションアプローチへの道を開きます。
電話のセグメンテーションなど、主に入力信号のスペクトル変化に焦点を当てる従来の音声および音声のセグメンテーションとは異なり、私たちのアプローチは、翻訳されない音響意味論的な情報に焦点を当て、話された発話を異なる音響意味論的スタイルを持つチャンクに分割しようとします。
感情や話者などのテキストにうまく組み込まれます。
ほとんどの音声セグメント化タスクは、感情ダイアライゼーションなどの 1 つのスタイル変更のみを処理しますが、私たちのアプローチは複数の音響意味論的なスタイル変更を処理しようとします。
音声言語モデル (SLM) の最近の進歩を活用して、与えられた音声発話をセグメント化する単純な教師なし手法を提案します。
いくつかの設定を検討することで、提案されたアプローチの有効性を経験的に実証します。
結果は、提案された方法が、境界検出、セグメント純度、およびオーバーセグメンテーションに関して評価されたベースラインよりも優れていることを示唆しています。
コードは https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm で入手できます。

要約(オリジナル)

In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.

arxiv情報

著者	Avishai Elmakies,Omri Abend,Yossi Adi
発行日	2025-01-07 11:32:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー