BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

要約

音声理解機能を事前トレーニングされた大規模言語モデルに組み込むことは、研究の重要な方向性になっています (SpeechLLM)。
以前のアーキテクチャは次のように分類できます。 i) GPT スタイル。デコーダ専用モデルのように、一連の LLM 入力として音声プロンプトをテキストプロンプトの前に追加します。
ii) T5 スタイル。事前トレーニングされた LLM の各層に音声クロスアテンションを導入します。
私たちは、TwO Worlds の BEST 機能を、高効率で強力なマルチタスク機能を備えた単一のモデルに組み込む BESTOW アーキテクチャを提案します。
さらに、特にソリューションが音声マルチタスクに一般化される必要があることを考慮すると、どちらのスタイルにも明確なストリーミングソリューションはありません。
私たちは、ストリーミング可能な SpeechLLM を読み書きポリシー問題として再定式化し、オフラインとストリーミングの研究を BESTOW アーキテクチャと統合します。
そこで、(ASR を超えて) 大規模なストリーミングとマルチタスクを同時に可能にする、初のオープンソース SpeechLLM ソリューションを実証します。
このストリーミング可能なソリューションは、幅広い音声タスク (ASR、AST、SQA、目に見えない DynamicSuperb) で非常に強力なパフォーマンスを実現します。
エンドツーエンドの最適化が可能で、トレーニング/推論コストが低く、LLM 知識の音声への伝達可能性を示しています。

要約(オリジナル)

Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.

arxiv情報

著者	Zhehuai Chen,He Huang,Oleksii Hrinchuk,Krishna C. Puvvada,Nithin Rao Koluguri,Piotr Żelasko,Jagadeesh Balam,Boris Ginsburg
発行日	2024-06-28 14:40:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー