Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

要約

命令調整された大規模言語モデル (LLM) とエンドツーエンドの自動音声認識 (ASR) の新しい統合を紹介します。
最新の LLM は、テキスト生成プロセスを目的のタスクに導くための正確な指示またはプロンプトが提供されると、ゼロショット学習内で幅広い言語タスクを実行できます。
私たちは、LLM のこのゼロショット機能を使用して、ASR パフォーマンスの向上に貢献できる言語情報を抽出することを検討します。
具体的には、LLM に ASR 仮説の文法エラーを修正し、埋め込まれた言語知識を利用してエンドツーエンドの ASR を実行するように指示します。
提案されたモデルは、ハイブリッドコネクショニスト時間分類 (CTC) およびアテンションアーキテクチャに基づいて構築されており、命令調整された LLM (つまり、Llama2) がデコーダーのフロントエンドとして採用されています。
修正の対象となる ASR 仮説は、CTC デコーディングを介してエンコーダーから取得され、命令とともに LLM に供給されます。
その後、デコーダは LLM 埋め込みを入力として受け取り、エンコーダ出力からの音響情報を組み込んだシーケンス生成を実行します。
実験結果と分析は、提案された統合により有望なパフォーマンス向上がもたらされ、私たちのアプローチが LLM ベースの再スコアリングから大きく恩恵を受けることを示しています。

要約(オリジナル)

We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can perform a wide range of linguistic tasks within zero-shot learning when provided with a precise instruction or a prompt to guide the text generation process towards the desired task. We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance. Specifically, we direct an LLM to correct grammatical errors in an ASR hypothesis and harness the embedded linguistic knowledge to conduct end-to-end ASR. The proposed model is built on the hybrid connectionist temporal classification (CTC) and attention architecture, where an instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder. An ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding, which is then fed into the LLM along with an instruction. The decoder subsequently takes as input the LLM embeddings to perform sequence generation, incorporating acoustic information from the encoder output. Experimental results and analyses demonstrate that the proposed integration yields promising performance improvements, and our approach largely benefits from LLM-based rescoring.

arxiv情報

著者	Yosuke Higuchi,Tetsuji Ogawa,Tetsunori Kobayashi
発行日	2023-09-19 11:10:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー