VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

要約

最近の研究では、さまざまなモダリティのさまざまなタスクにわたって、モデルアーキテクチャ、トレーニング目標、推論方法が大きく収束していることが示されています。
この論文では、音声からテキストへの変換、テキストからテキストへの変換、テキストから音声への変換など、音声とテキストに関係するさまざまなクロスモーダルタスクを統合する、単一の自己回帰 Transformer デコーダー専用ネットワークである VioLA を提案します。
マルチタスク学習フレームワークを介した条件付きコーデック言語モデルタスクとしての音声読み上げタスク。
これを達成するには、まずオフラインニューラルコーデックエンコーダーを使用して、すべての音声発話を個別のトークン (テキストデータに類似) に変換します。
このようにして、これらすべてのタスクはトークンベースのシーケンス変換問題に変換され、1 つの条件付き言語モデルで自然に処理できます。
さらに、タスク ID (TID) と言語 ID (LID) を提案モデルに統合し、さまざまな言語やタスクを処理するモデリング機能を強化します。
実験結果は、提案された VioLA モデルがシングルモーダルタスクとクロスモーダルタスクの両方を適切にサポートでき、デコーダのみのモデルが強力なベースラインと同等の、さらにはそれよりも優れたパフォーマンスを達成することを示しています。

要約(オリジナル)

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.

arxiv情報

著者	Tianrui Wang,Long Zhou,Ziqiang Zhang,Yu Wu,Shujie Liu,Yashesh Gaur,Zhuo Chen,Jinyu Li,Furu Wei
発行日	2023-05-25 14:39:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー