BAT: Learning to Reason about Spatial Sounds with Large Language Models

要約

空間音推論は人間の基本的な能力であり、音に基づいて周囲の環境をナビゲートし、解釈することを可能にする。本論文では、この生得的な能力を再現するために、バイノーラル音響シーン解析モデルの空間音知覚能力と大規模言語モデル（LLM）の自然言語推論能力を組み合わせたBATを紹介する。野生の空間音の既存のデータセットの欠如に対処するため、我々はAudioSetとSoundSpaces 2.0を用いてバイノーラル音声データセットを合成した。次に、空間音ベースの質問応答データセットであるSpatialSoundQAを開発し、空間音の知覚と推論の様々な側面においてBATを訓練する様々なQAタスクを提供した。BATの音響フロントエンドエンコーダは、Spatial Audio Spectrogram Transformer (Spatial-AST)と呼ばれる新しい空間音声エンコーダである。Spatial-ASTをLLaMA-2 7Bモデルと統合することで、BATは標準的なSELD（Sound Event Localization and Detection）タスクを超越し、環境中の音の関係を推論することが可能になります。我々の実験は、空間的な音の知覚と推論の両方においてBATの優れた性能を実証し、複雑な空間的オーディオ環境のナビゲーションと解釈におけるLLMの計り知れない可能性を示している。

要約(オリジナル)

Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed SpatialSoundQA, a spatial sound-based question-answering dataset, offering a range of QA tasks that train BAT in various aspects of spatial sound perception and reasoning. The acoustic front end encoder of BAT is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by itself achieves strong performance across sound event detection, spatial localization, and distance estimation. By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment. Our experiments demonstrate BAT’s superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.

arxiv情報

著者	Zhisheng Zheng,Puyuan Peng,Ziyang Ma,Xie Chen,Eunsol Choi,David Harwath
発行日	2024-02-02 17:34:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

BAT: Learning to Reason about Spatial Sounds with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー