SpeechVerse: A Large-scale Generalizable Audio Language Model

要約

大規模言語モデル (LLM) は、自然言語命令の意味的理解を必要とするタスクの実行において、驚くべき熟練度を示しています。
最近では、多くの作品がこの機能をさらに拡張して、マルチモーダルなオーディオおよびテキスト入力を認識できるようにしていますが、その機能は多くの場合、自動音声認識や翻訳などの特定の微調整されたタスクに限定されています。
そこで私たちは、学習可能なパラメータの小さなセットを介して事前トレーニングされた音声とテキストの基礎モデルを組み合わせ、トレーニング中に事前トレーニングされたモデルをフリーズしたままにする、堅牢なマルチタスクトレーニングおよびカリキュラム学習フレームワークである SpeechVerse を開発します。
これらのモデルは、自然言語命令を使用したさまざまな音声処理タスクで最適なゼロショットパフォーマンスを達成するために、音声基礎モデルから抽出された連続潜在表現を使用して命令が微調整されています。
当社では、複数のデータセットおよびタスクにわたってモデルのパフォーマンスを従来のベースラインと比較するなど、広範なベンチマークを実行します。
さらに、ドメイン外のデータセット、新しいプロンプト、目に見えないタスクをテストすることによって、その後の一般化された指示に対するモデルの機能を評価します。
私たちの実証実験により、11 タスク中 9 タスクにおいて、マルチタスク SpeechVerse モデルが従来のタスク固有のベースラインよりもさらに優れていることが明らかになりました。

要約(オリジナル)

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model’s capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

arxiv情報

著者	Nilaksh Das,Saket Dingliwal,Srikanth Ronanki,Rohit Paturi,David Huang,Prashant Mathur,Jie Yuan,Dhanush Bekal,Xing Niu,Sai Muralidhar Jayanthi,Xilai Li,Karel Mundnich,Monica Sunkara,Sundararajan Srinivasan,Kyu J Han,Katrin Kirchhoff
発行日	2024-05-14 03:33:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpeechVerse: A Large-scale Generalizable Audio Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー