End-to-End Simultaneous Speech Translation with Differentiable Segmentation

要約

エンドツーエンド同時音声翻訳 (SimulST) は、ストリーミング音声入力を受信しながら翻訳を出力する (別名ストリーミング音声翻訳) ため、音声入力をセグメント化し、現在受信している音声に基づいて翻訳する必要があります。
ただし、好ましくない瞬間に音声入力をセグメント化すると、音響の完全性が損なわれ、翻訳モデルのパフォーマンスに悪影響を及ぼす可能性があります。
したがって、翻訳モデルが高品質の翻訳を生成するのに有益な瞬間の音声入力をセグメント化する方法を学習することが、SimulST の鍵となります。
既存の SimulST メソッドは、固定長セグメンテーションモデルまたは外部セグメンテーションモデルを使用する場合、常に基礎となる翻訳モデルからセグメンテーションを分離します。そのギャップにより、翻訳プロセスにとって必ずしも有益ではないセグメンテーションの結果が生じます。
この論文では、基礎となる翻訳モデルからセグメンテーションを直接学習する SimulST の Differentiable Segmentation (DiSeg) を提案します。
DiSeg は、提案された期待トレーニングを通じてハードセグメンテーションを微分可能に変換し、翻訳モデルと共同トレーニングして、翻訳に有益なセグメンテーションを学習できるようにします。
実験結果は、DiSeg が最先端のパフォーマンスを達成し、優れたセグメンテーション機能を発揮することを示しています。

要約(オリジナル)

End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose Differentiable Segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model. DiSeg turns hard segmentation into differentiable through the proposed expectation training, enabling it to be jointly trained with the translation model and thereby learn translation-beneficial segmentation. Experimental results demonstrate that DiSeg achieves state-of-the-art performance and exhibits superior segmentation capability.

arxiv情報

著者	Shaolei Zhang,Yang Feng
発行日	2023-05-25 14:25:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー