WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

要約

オーディオ言語 (AL) マルチモーダル学習タスクの進歩は、近年重要です。
ただし、研究者は、サイズが限られている既存の音声言語データセットの収集プロセスにコストと時間がかかるため、課題に直面しています。
このデータ不足の問題に対処するために、WavCaps を導入しました。これは、ペアのキャプションを含む約 400,000 のオーディオクリップで構成される、最初の大規模な弱いラベルの付いたオーディオキャプションデータセットです。
オーディオクリップとその生の説明は、Web ソースとサウンドイベント検出データセットから入手しました。
ただし、オンラインで収集された未加工の説明は非常にノイズが多く、自動オーディオキャプションなどのタスクで直接使用するには適していません。
この問題を克服するために、ノイズの多いデータをフィルタリングし、高品質のキャプションを生成するための 3 段階の処理パイプラインを提案します。ここでは、大規模な言語モデルである ChatGPT を利用して生の説明を自動的にフィルタリングおよび変換します。
WavCaps データセットの特性の包括的な分析を行い、複数のダウンストリームオーディオ言語マルチモーダル学習タスクで評価します。
WavCaps でトレーニングされたシステムは、以前の最先端 (SOTA) モデルよりも大幅に優れています。
私たちの願いは、音声言語のマルチモーダル学習の研究を促進し、ChatGPT を利用して学術研究を強化する可能性を実証するために提案した WavCaps データセットです。
データセットとコードは、https://github.com/XinhaoMei/WavCaps で入手できます。

要約(オリジナル)

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

arxiv情報

著者	Xinhao Mei,Chutong Meng,Haohe Liu,Qiuqiang Kong,Tom Ko,Chengqi Zhao,Mark D. Plumbley,Yuexian Zou,Wenwu Wang
発行日	2023-03-30 14:07:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー