Speech Retrieval-Augmented Generation without Automatic Speech Recognition

要約

音声データに対する質問応答の一般的なアプローチの 1 つは、最初に自動音声認識 (ASR) を使用して音声を書き起こし、次にその書き起こしに対してテキストベースの検索拡張生成 (RAG) を使用することです。
このカスケードパイプラインは多くの実際の設定で効果的であることが証明されていますが、ASR エラーが取得および生成のステップに伝播する可能性があります。
この制限を克服するために、音声データを介して自由質問に回答するために設計された新しいフレームワークである SpeechRAG を紹介します。
私たちが提案するアプローチは、事前にトレーニングされた音声エンコーダーを、凍結された大規模言語モデル (LLM) ベースの検索モデルに入力される音声アダプターに微調整します。
テキストと音声の埋め込みスペースを調整することにより、音声取得機能は、フリーズされたテキスト取得機能の検索能力を利用して、テキストベースのクエリから音声パッセージを直接取得します。
音声による質問応答データセットの検索実験では、直接音声検索がテキストベースのベースラインよりも性能が低下せず、ASR を使用したカスケードシステムよりも優れていることが示されました。
生成には、トランスクリプトではなくオーディオの一節を条件とした音声言語モデル (SLM) をジェネレーターとして使用します。
SLM を微調整しなくても、トランスクリプトの WER が高い場合、このアプローチはカスケードされたテキストベースのモデルよりも優れたパフォーマンスを発揮します。

要約(オリジナル)

One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)–based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.

arxiv情報

著者	Do June Min,Karel Mundnich,Andy Lapastora,Erfan Soltanmohammadi,Srikanth Ronanki,Kyu Han
発行日	2025-01-02 07:29:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Speech Retrieval-Augmented Generation without Automatic Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー