Tuning Large language model for End-to-end Speech Translation

要約

大規模言語モデル（LLM）の出現により、LLMに基づくマルチモーダルモデルは大きな可能性を示している。LLaSM、X-LLM、SpeechGPTなどのモデルは、人間の指示を理解し、生成する素晴らしい能力を示しています。しかし、言語やモダルを超えた翻訳タスクであるエンドツーエンドの音声翻訳（E2E-ST）のような複雑なタスクに直面すると、その性能はしばしば低下する。シングルモーダルモデルと比較して、マルチモーダルモデルはこのようなシナリオで遅れをとる。本論文では、E2E-STタスクを得意とするように設計された大型マルチモーダルモデルであるLSTを紹介する。LSTは音声フロントエンド、アダプタ、LLMバックエンドから構成される。LSTの学習は2つの段階から構成される：(1) モダリティ調整：音声表現とテキスト埋め込み空間を整合させるためにアダプタを調整する、(2) ダウンストリームタスク微調整：E2ESTタスクの性能を最適化するためにアダプタとLLMモデルの両方を学習する。MuST-C音声翻訳ベンチマークでの実験結果は、LST-13BがEn-De/En-Fr/En-Es言語ペアにおいて30.39/41.55/35.33のBLEUスコアを達成し、従来のモデルを凌駕し、新たな最先端を確立したことを示している。さらに、シングルモーダルモデルの選択と学習ストラテジーの影響について詳細な分析を行い、今後の研究の基礎を築きます。レビュー後、コードとモデルを公開する予定である。

要約(オリジナル)

With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal translation task. In comparison to single-modal models, multimodal models lag behind in these scenarios. This paper introduces LST, a Large multimodal model designed to excel at the E2E-ST task. LST consists of a speech frontend, an adapter, and a LLM backend. The training of LST consists of two stages: (1) Modality adjustment, where the adapter is tuned to align speech representation with text embedding space, and (2) Downstream task fine-tuning, where both the adapter and LLM model are trained to optimize performance on the E2EST task. Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B achieves BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art. Additionally, we conduct an in-depth analysis of single-modal model selection and the impact of training strategies, which lays the foundation for future research. We will open up our code and models after review.

arxiv情報

著者	Hao Zhang,Nianwen Si,Yaqi Chen,Wenlin Zhang,Xukui Yang,Dan Qu,Xiaolin Jiao
発行日	2023-10-03 13:43:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Tuning Large language model for End-to-end Speech Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー