AudioPaLM: A Large Language Model That Can Speak and Listen

要約

音声の理解と生成のための大規模な言語モデルである AudioPaLM を紹介します。
AudioPaLM は、テキストベースと音声ベースの言語モデルである PaLM-2 [Anil et al., 2023] と AudioLM [Borsos et al., 2022] を、次のようなアプリケーションでテキストと音声を処理および生成できる統合マルチモーダルアーキテクチャに融合します。
音声認識と音声から音声への翻訳。
AudioPaLM は、話者の身元やイントネーションなどのパラ言語情報を AudioLM から保持する機能と、PaLM-2 などのテキスト大規模言語モデルにのみ存在する言語知識を継承します。
テキストのみの大規模言語モデルの重みを使用して AudioPaLM を初期化すると、音声処理が向上し、事前トレーニングで使用される大量のテキストトレーニングデータをうまく活用して音声タスクを支援できることを示します。
結果として得られたモデルは、音声翻訳タスクに関して既存のシステムを大幅に上回り、トレーニングでは入力言語とターゲット言語の組み合わせが見られなかった多くの言語に対して、ゼロショットの音声からテキストへの翻訳を実行する機能を備えています。
AudioPaLM は、短い音声プロンプトに基づいて言語間で音声を転送するなど、オーディオ言語モデルの機能も示します。
https://google-research.github.io/seanet/audiopalm/examples でメソッドの例をリリースしています。

要約(オリジナル)

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

arxiv情報

著者	Paul K. Rubenstein,Chulayuth Asawaroengchai,Duc Dung Nguyen,Ankur Bapna,Zalán Borsos,Félix de Chaumont Quitry,Peter Chen,Dalia El Badawy,Wei Han,Eugene Kharitonov,Hannah Muckenhirn,Dirk Padfield,James Qin,Danny Rozenberg,Tara Sainath,Johan Schalkwyk,Matt Sharifi,Michelle Tadmor,Ramanovich,Marco Tagliasacchi,Alexandru Tudor,Mihajlo Velimirović,Damien Vincent,Jiahui Yu,Yongqiang Wang,Vicky Zayats,Neil Zeghidour,Yu Zhang,Zhishuai Zhang,Lukas Zilka,Christian Frank
発行日	2023-06-22 14:37:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AudioPaLM: A Large Language Model That Can Speak and Listen

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー