Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

要約

同時音声翻訳（SIMULST）は、部分的な音声入力を処理しながら翻訳を段階的に生成します。
大規模な言語モデル（LLM）は、オフラインの翻訳タスクで強力な機能を紹介していますが、それらをシミュレーションに適用すると顕著な課題があります。
既存のLLMベースのSimulstアプローチは、双方向の音声エンコーダーのエンコードが繰り返されるため、有意な計算オーバーヘッドを負担するか、固定された読み取り/書き込みポリシーに依存して、効率とパフォーマンスを制限します。
この作業では、音声エンコーダとLLMの両方を含む完全に一方向のアーキテクチャを備えた効率的かつ適応的な同時音声翻訳（EASIST）を導入します。
EASISTには、明示的な読み取り/書き込みトークンを使用したインターリーブ生成タスクとして、セマンティックに整列したシミュレーショントレーニングサンプルを生成し、Simulstを再定義するための多発性データキュレーション戦略が含まれています。
適応推論を促進するために、読み取り/書き込みアクションを動的に予測する軽量ポリシーヘッドを組み込みます。
さらに、音声テキストモダリティを調整し、翻訳とポリシーの両方の行動を最適化するために、マルチステージトレーニング戦略を採用しています。
マスト-C en $ \ rightArrow $ deおよびen $ \ rightArrow $ esデータセットの実験は、Easistがいくつかの強力なベースラインと比較して優れたレイテンシー品質のトレードオフを提供することを示しています。

要約(オリジナル)

Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.

arxiv情報

著者	Biao Fu,Donglei Yu,Minpeng Liao,Chengxi Li,Yidong Chen,Kai Fan,Xiaodong Shi
発行日	2025-04-16 06:46:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー