Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

要約

仮想アシスタントとの対話は通常、トリガーフレーズで始まり、その後にコマンドが続きます。
この研究では、トリガーフレーズの必要性を排除することで、これらのインタラクションをより自然にする可能性を探ります。
私たちの目標は、デバイスのマイクによって録音されたストリーミングオーディオから取得した信号に基づいて、ユーザーが仮想アシスタントに話しかけたかどうかを判断することです。
私たちは、自動音声認識システムからの 1-best 仮説とデコーダ信号を、大規模言語モデル (LLM) への入力特徴としてオーディオエンコーダからの音響表現と組み合わせることで、このタスクに取り組みます。
特に、少量のトレーニングデータのみを必要とし、デバイス上で利用できるフリーズされた LLM が 1 つだけのシナリオで動作できる、データとリソースの効率に優れたシステムに興味があります。
このため、私たちのモデルは、低ランク適応とプレフィックス調整の組み合わせを使用して、80,000 以下のマルチモーダルデータの例でトレーニングされています。
提案されたシステムを単峰性ベースラインと比較し、訓練データの一部のみを使用しながら、多峰性アプローチがより低い等誤り率 (EER) を達成することを示します。
また、低次元の特殊なオーディオ表現は、高次元の一般的なオーディオ表現よりも低い EER につながることも示します。

要約(オリジナル)

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

arxiv情報

著者	Dominik Wagner,Alexander Churchill,Siddharth Sigtia,Panayiotis Georgiou,Matt Mirsamadi,Aarshee Mishra,Erik Marchi
発行日	2023-12-06 17:29:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー