SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

要約

大規模なオーディオ言語モデル（LALMS）は、スピーチ、オーディオなどのマルチモーダル理解を備えた大規模な言語モデルを拡張します。音声およびオーディオ処理タスクに関するパフォーマンスは広範囲に研究されていますが、推論能力は未定のままです。
特に、彼らのマルチホップの推論、複数の事実を思い出して統合する能力は、体系的な評価を欠いています。
既存のベンチマークは、一般的な音声およびオーディオ処理タスク、会話能力、公平性に焦点を当てていますが、この側面を見落としています。
このギャップを埋めるために、スピーチとオーディオ情報に基づいたLALMSのマルチホップ推論を評価するベンチマークであるSakuraを紹介します。
結果は、LALMSが関連情報を正しく抽出した場合でも、マルチホップの推論のための音声/オーディオ表現を統合するのに苦労し、マルチモーダル推論の基本的な課題を強調していることを示しています。
私たちの調査結果は、LALMSの重大な制限を明らかにし、将来の研究のための洞察とリソースを提供します。

要約(オリジナル)

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs’ multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

arxiv情報

著者	Chih-Kai Yang,Neo Ho,Yen-Ting Piao,Hung-yi Lee
発行日	2025-05-19 15:20:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー