Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

要約

音声翻訳をストリーミングするための一般的なアプローチは、\textit{wait-$k$} ポリシーを備えた単一のオフラインモデルを採用して、さまざまな待ち時間要件をサポートすることです。これは、さまざまな待ち時間制約で複数のオンラインモデルをトレーニングするよりも簡単です。
ただし、完全な発話でトレーニングされたモデルを部分入力によるストリーミング推論に使用すると、ミスマッチの問題が発生します。
ストリーミング入力の最後に抽出された音声表現は、完全な発話から抽出されたものとは大きく異なることを示します。
この問題に対処するために、ストリーミング入力にオフライン ST モデルを適応させる Future-Aware Streaming Translation (FAST) と呼ばれる新しいアプローチを提案します。
FAST には、トレーニング可能なマスクされた埋め込みを通じて未来のコンテキストを組み込む Future-Aware Inference (FAI) 戦略と、完全なスピーチの近似からストリーミング入力に未来のコンテキストを転送する Future-Aware Distillation (FAD) フレームワークが含まれています。
MuST-C EnDe、EnEs、および EnFr ベンチマークでの実験では、FAST が強力なベースラインよりも翻訳品質と遅延の間のトレードオフが優れていることが示されています。
広範な分析は、私たちの方法が前述のオフライントレーニングとオンライン推論の間の不一致の問題を効果的に軽減することを示唆しています。

要約(オリジナル)

A popular approach to streaming speech translation is to employ a single offline model with a \textit{wait-$k$} policy to support different latency requirements, which is simpler than training multiple online models with different latency constraints. However, there is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input. We demonstrate that speech representations extracted at the end of a streaming input are significantly different from those extracted from a complete utterance. To address this issue, we propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input. FAST includes a Future-Aware Inference (FAI) strategy that incorporates future context through a trainable masked embedding, and a Future-Aware Distillation (FAD) framework that transfers future context from an approximation of full speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that FAST achieves better trade-offs between translation quality and latency than strong baselines. Extensive analyses suggest that our methods effectively alleviate the aforementioned mismatch problem between offline training and online inference.

arxiv情報

著者	Biao Fu,Kai Fan,Minpeng Liao,Zhongqiang Huang,Boxing Chen,Yidong Chen,Xiaodong Shi
発行日	2023-03-14 13:56:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー