Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

要約

音声生成は、さまざまな調音の特徴の調整を含む複雑な連続プロセスです。
その中で、舌は、気流を形作って知的で明確で、明確なターゲットを絞った音声音を生成するための非常に用途の広いアクティブなアーチキチュレーターです。
この論文は、積み重なった双方向の短期記憶（BILSTM）アーキテクチャを使用して、特定の音声音響に関与する舌および唇の調音の特徴を予測するための新しいアプローチを提示し、固定体重の初期化を伴うポスト処理のための1次元の畳み込みニューラルネットワーク（CNN）と組み合わせています。
提案されたネットワークは、同時に録音された音声と電磁アーティクログラフィ（EMA）データセットで構成される2つのデータセットでトレーニングされており、それぞれ地理的起源、言語特性、音声多様性、および記録装置の観点からバリエーションを導入します。
モデルのパフォーマンスは、スピーカー依存（SD）、スピーカー独立（SI）、コーパス依存（CD）、およびクロスコーパス（CC）モードで評価されます。
実験結果は、固定重量アプローチを備えた提案されたモデルが、比較的最小限のトレーニングエポックで適応重みの初期化を上回ったことを示しています。
これらの発見は、調音の特徴予測のための堅牢で効率的なモデルの開発に貢献し、音声生産研究とアプリケーションの進歩への道を開いています。

要約(オリジナル)

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

arxiv情報

著者	Leena G Pillai,D. Muhammad Noorul Mubarak,Elizabeth Sherly
発行日	2025-04-25 05:57:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー