Representation Purification for End-to-End Speech Translation

要約

音声からテキストへの翻訳 (ST) は、音声言語を別の言語のテキストに変換するクロスモーダルタスクです。
これまでの研究は主に、機械翻訳からの知識伝達を促進することで音声翻訳を強化することに焦点を当て、音声とテキストのモダリティ間のギャップを埋めるためのさまざまな方法を模索していました。
大幅な進歩が見られたにもかかわらず、音色やリズムなど、翻訳内容に関係のない音声要素によって知識伝達の効率が制限されることがよくあります。
この論文では、音声表現を内容に依存しない要素と内容に関連した要素の組み合わせとして概念化します。
私たちは、予備実験を通じて内容に依存しない要因が翻訳パフォーマンスに及ぼす影響を調査し、内容に依存しない摂動が音声信号に導入されると大幅なパフォーマンスの低下を観察しました。
この問題に対処するために、コンテンツに依存しない \textbf{S}upervision \textbf{E}nhancement (SRPSE) フレームワークを使用した \textbf{S}peech \textbf{R}epresentation \textbf{P}urification を提案します。
音声表現内のコンポーネントを調整して、ST への悪影響を軽減します。
MuST-C および CoVoST-2 データセットの実験では、SRPSE が 3 つの設定ですべての翻訳方向にわたって翻訳パフォーマンスを大幅に向上させ、\textit{transcript-free} 設定下で優れたパフォーマンスを達成することが実証されました。

要約(オリジナル)

Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. Previous research primarily focused on enhancing speech translation by facilitating knowledge transfer from machine translation, exploring various methods to bridge the gap between speech and text modalities. Despite substantial progress made, factors in speech that are not relevant to translation content, such as timbre and rhythm, often limit the efficiency of knowledge transfer. In this paper, we conceptualize speech representation as a combination of content-agnostic and content-relevant factors. We examine the impact of content-agnostic factors on translation performance through preliminary experiments and observe a significant performance deterioration when content-agnostic perturbations are introduced to speech signals. To address this issue, we propose a \textbf{S}peech \textbf{R}epresentation \textbf{P}urification with \textbf{S}upervision \textbf{E}nhancement (SRPSE) framework, which excludes the content-agnostic components within speech representations to mitigate their negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate that SRPSE significantly improves translation performance across all translation directions in three settings and achieves preeminent performance under a \textit{transcript-free} setting.

arxiv情報

著者	Chengwei Zhang,Yue Zhou,Rui Zhao,Yidong Chen,Xiaodong Shi
発行日	2024-12-05 15:50:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Representation Purification for End-to-End Speech Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー