Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

要約

大規模な言語モデル（LLM）は最近、テキストだけでなく、音声やオーディオなどのマルチモーダル入力も処理する顕著な能力を示しています。
ただし、ほとんどの既存のモデルは、主にテキストの指示を使用して入力信号の分析に焦点を当てており、音声命令とオーディオが混合され、モデルへの入力として機能するシナリオを見落としています。
これらの課題に対処するために、音声ベースの質問を理解し、音響コンテキストを同時に聞くように設計された新しいフレームワークであるSollaを紹介します。
Sollaには、オーディオイベントを効果的に識別および表現するオーディオタグモジュールと、音声コンテンツの理解を改善するためのASR支援予測方法が組み込まれています。
Sollaおよびその他の公開されているモデルを厳密に評価するために、Sa-Evalと呼ばれる新しいベンチマークデータセットを提案します。これには、オーディオイベント分類、オーディオキャプション、オーディオ質問回答の3つのタスクが含まれます。
SA-Valは、さまざまなスピーキングスタイルを備えた多様な音声指導を受けており、実際の音響条件の範囲を捉えるために、簡単かつ難しい2つの難易度を網羅しています。
実験結果は、Sollaが簡単なテストセットとハードテストセットの両方でベースラインモデルと同等またはアウトパフォームすることを示しており、共同で音声と音声を理解する上でその有効性を強調しています。

要約(オリジナル)

Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.

arxiv情報

著者	Junyi Ao,Dekun Chen,Xiaohai Tian,Wenjie Feng,Jun Zhang,Lu Lu,Yuxuan Wang,Haizhou Li,Zhizheng Wu
発行日	2025-03-19 15:34:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー