Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

要約

最近、AI コミュニティは、大規模なマルチモーダルデータセットを活用した強力な基礎モデルの開発において大きな進歩を遂げました。
ただし、音声表現の学習に関しては、既存のデータセットには、不十分な量、単純な内容、および困難な収集手順という側面で制限があります。
高品質のキャプションを含む音声データセットを確立するために、ビデオフレームや音声ストリームなどのマルチモーダル入力を活用する革新的な自動アプローチを提案します。
具体的には、150 万を超える音声とテキストのペアで構成される、Auto-ACD と名付けられた大規模で高品質の音声言語データセットを構築します。
一連の事前トレーニングされたモデルまたは API を利用して、オーディオとビジュアルの同期を決定し、画像キャプション、オブジェクト検出、または特定のビデオのオーディオタグを生成します。
続いて、LLM を使用して、抽出されたマルチモダリティの手がかりに基づいて、各音声の一致するキャプションを言い換えます。
提案されたデータセットの有効性を実証するために、データセットで広く使用されているモデルをトレーニングし、音声言語の検索、音声キャプション、ゼロショット分類などのさまざまな下流タスクのパフォーマンスの向上を示します。
さらに、環境情報を使用した新しいベンチマークを確立し、オーディオテキストタスクのベンチマークを提供します。

要約(オリジナル)

Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.

arxiv情報

著者	Luoyi Sun,Xuenan Xu,Mengyue Wu,Weidi Xie
発行日	2024-09-09 14:52:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー