Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases

要約

音声からテキストへの翻訳 (S2TT) は、通常、音声認識システムが文字起こしを生成し、その後翻訳モデルに渡すカスケードシステムで対処されています。
エラーの伝播や非言語コンテンツの損失を回避するために直接音声翻訳システムを開発することへの関心が高まっている一方で、直接 S2TT におけるこれまでの研究では、音響信号を翻訳プロセスに直接統合する利点を最終的に確立するのに苦労していました。
この研究では、韻律が重要な役割を果たす発話を明確にする直接 S2TT システムの能力を定量的に測定するために、対照的な評価を使用することを提案しています。
具体的には、wh フレーズを含むテストセットで韓国語と英語の翻訳システムを評価しました。wh フレーズは、ステートメント、はい/いいえの質問、wh 質問など、正しい意図を持った翻訳を生成するために韻律上の特徴が必要です。
。
私たちの結果は、カスケード翻訳モデルよりも直接翻訳システムの価値を明確に示しており、曖昧なケースで全体の精度が 12.9% 向上し、主要な意図カテゴリの 1 つで F1 スコアが最大 15.6% 向上しました。
私たちの知る限り、この研究は、直接 S2TT モデルが韻律を効果的に活用できることを示す定量的証拠を提供する最初の研究です。
評価用のコードはオープンにアクセスでき、レビューや利用のために自由に利用できます。

要約(オリジナル)

Speech-to-Text Translation (S2TT) has typically been addressed with cascade systems, where speech recognition systems generate a transcription that is subsequently passed to a translation model. While there has been a growing interest in developing direct speech translation systems to avoid propagating errors and losing non-verbal content, prior work in direct S2TT has struggled to conclusively establish the advantages of integrating the acoustic signal directly into the translation process. This work proposes using contrastive evaluation to quantitatively measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Specifically, we evaluated Korean-English translation systems on a test set containing wh-phrases, for which prosodic features are necessary to produce translations with the correct intent, whether it’s a statement, a yes/no question, a wh-question, and more. Our results clearly demonstrate the value of direct translation systems over cascade translation models, with a notable 12.9% improvement in overall accuracy in ambiguous cases, along with up to a 15.6% increase in F1 scores for one of the major intent categories. To the best of our knowledge, this work stands as the first to provide quantitative evidence that direct S2TT models can effectively leverage prosody. The code for our evaluation is openly accessible and freely available for review and utilisation.

arxiv情報

著者	Giulio Zhou,Tsz Kin Lam,Alexandra Birch,Barry Haddow
発行日	2024-02-01 14:46:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー