Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

要約

リアルタイムかつ継続的な方向転換予測システムのデモンストレーションが行われます。
このシステムは、対話ステレオ音声を将来の音声アクティビティに直接マッピングする音声アクティビティ投影 (VAP) モデルに基づいています。
VAP モデルには、コントラスト予測コーディング (CPC) とセルフアテンショントランスフォーマー、その後にクロスアテンショントランスフォーマーが含まれています。
入力コンテキストオーディオの長さの影響を調べ、提案されたシステムがパフォーマンスの低下を最小限に抑えながら、CPU 設定でリアルタイムに動作できることを実証します。

要約(オリジナル)

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

arxiv情報

著者	Koji Inoue,Bing’er Jiang,Erik Ekstedt,Tatsuya Kawahara,Gabriel Skantze
発行日	2024-01-10 01:09:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー