Audio Dialogues: Dialogues dataset for audio and music understanding

要約

音声理解のための既存のデータセットは、自然言語で音声を説明するためのシングルターンインタラクション (音声キャプション、音声質問応答など) に主に焦点を当てているため、インタラクティブな対話による音声理解は制限されています。
このギャップに対処するために、オーディオダイアログを導入します。これは、一般的なオーディオサウンドと音楽の 163.8k サンプルを含むマルチターンダイアログデータセットです。
ダイアログに加えて、オーディオダイアログには、複数の入力音声を理解して比較するための質問と回答のペアもあります。
Audio Dialogues は、プロンプトベースのアプローチと既存のデータセットからのキャプションアノテーションを利用して、大規模言語モデル (LLM) を使用してマルチターンダイアログを生成します。
私たちは、音声ダイアログの複雑さと適用性を実証するために、提案したデータセットに基づいて既存の音声拡張された大規模言語モデルを評価します。
データセットを生成するコードは公開される予定です。
詳細なプロンプトと生成されたダイアログは、デモ Web サイト https://audiodialogues.github.io/ でご覧いただけます。

要約(オリジナル)

Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website https://audiodialogues.github.io/.

arxiv情報

著者	Arushi Goel,Zhifeng Kong,Rafael Valle,Bryan Catanzaro
発行日	2024-04-11 10:08:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Audio Dialogues: Dialogues dataset for audio and music understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー