Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

要約

ラベル付けされたデータの欠如は、音声分類タスク、特に認知状態分類などの広範な主観的評価を必要とするタスクにおける一般的な課題です。
この作業では、半監視学習（SSL）フレームワークを提案し、音響特性と言語特性の両方を活用して、分類モデルをトレーニングするための最も自信のあるデータを選択する新しいマルチビュー擬似標識方法を導入します。
音響的には、複数のオーディオエンコーダーによって生成された埋め込みから計算されたフレチェットオーディオ距離を使用して、ラベル付きデータと比較されます。
言語的には、大規模な言語モデルは、自動音声認識の転写を修正し、提案されたタスク固有の知識に基づいてラベルを予測するように求められます。
両方のソースからの擬似ラベルが整列し、不一致は低自信データとして扱われる一方で、高自信データが特定されます。
次に、事前定義された基準が満たされるまで、低コンフィデンスデータに繰り返しラベルを付けるようにバイモーダル分類器がトレーニングされます。
感情認識と認知症検出タスクに関するSSLフレームワークを評価します。
実験結果は、ラベル付けされたデータの30％のみを使用して、完全に監視された学習と比較して、この方法が競争力のあるパフォーマンスを達成し、選択された2つのベースラインを大幅に上回ることを示しています。

要約(オリジナル)

The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.

arxiv情報

著者	Yuanchao Li,Zixing Zhang,Jing Han,Peter Bell,Catherine Lai
発行日	2025-04-30 13:24:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー