Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

要約

この論文では、高品質で人間のような同時音声翻訳 (SiST) システムである Cross Language Agent — Simultaneous Interpretation (CLASI) を紹介します。
プロの人間の通訳者からインスピレーションを得た、新しいデータ駆動型の読み取り/書き込み戦略を利用して、翻訳品質と遅延のバランスをとります。
ドメイン内用語の翻訳の課題に対処するために、CLASI はマルチモーダル検索モジュールを採用して関連情報を取得し、翻訳を強化します。
LLM によってサポートされているこのアプローチでは、入力音声、歴史的コンテキスト、取得した情報を考慮して、エラーを許容できる翻訳を生成できます。
実験結果は、私たちのシステムが他のシステムよりも大幅に優れていることを示しています。
私たちはプロの人間の通訳者と連携し、聞き手にうまく伝えることができる情報量を測定する有効情報割合 (VIP) という、より優れた人間の評価指標を使用して CLASI を評価します。
現実世界のシナリオでは、スピーチがまとまりがなく、非公式で、不明瞭であることが多く、CLASI は中国語から英語への翻訳、英語から中国語への翻訳方向でそれぞれ 81.3% と 78.0% の VIP を達成しました。
対照的に、最先端の商用システムやオープンソースシステムは 35.4% と 41.6% しか達成していません。
他のシステムが 13% 未満の VIP を達成する非常にハードなデータセットでも、CLASI は 70% の VIP を達成できます。

要約(オリジナル)

In this paper, we present Cross Language Agent — Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of information that can be successfully conveyed to the listeners. In the real-world scenarios, where the speeches are often disfluent, informal, and unclear, CLASI achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese translation directions, respectively. In contrast, state-of-the-art commercial or open-source systems only achieve 35.4% and 41.6%. On the extremely hard dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70% VIP.

arxiv情報

著者	Shanbo Cheng,Zhichao Huang,Tom Ko,Hang Li,Ningxin Peng,Lu Xu,Qini Zhang
発行日	2024-07-31 14:48:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー