RECAP: Retrieval-Augmented Audio Captioning

要約

RECAP (REtrieval-Augmented Audio CAPtioning) は、入力音声に条件付けされたキャプションと、データストアから取得した音声に類似した他のキャプションを生成する、斬新で効果的な音声キャプションシステムです。
さらに、私たちが提案する方法は、追加の微調整を必要とせずに、任意のドメインに転送できます。
オーディオサンプルのキャプションを生成するには、オーディオテキストモデル CLAP を利用して、類似したキャプションを置換可能なデータストアから取得し、プロンプトの構築に使用します。
次に、このプロンプトを GPT-2 デコーダーにフィードし、CLAP エンコーダーと GPT-2 の間にクロスアテンションレイヤーを導入して、キャプション生成のために音声を調整します。
2 つのベンチマークデータセット、Clotho と AudioCaps での実験では、RECAP がドメイン内設定で競争力のあるパフォーマンスを実現し、ドメイン外設定で大幅な改善を達成することが示されています。
さらに、RECAP は、大規模なテキストキャプション専用のデータストアを \textit{トレーニング不要} の方法で利用できるため、トレーニング中には見られなかった新しいオーディオイベントや、複数のイベントを含む構成オーディオにキャプションを付ける独自の機能を示します。
この分野の研究を促進するために、AudioSet、AudioCaps、Clotho 用の 150,000 以上の新しい弱いラベルのキャプションもリリースします。

要約(オリジナル)

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a \textit{training-free} fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

arxiv情報

著者	Sreyan Ghosh,Sonal Kumar,Chandra Kiran Reddy Evuru,Ramani Duraiswami,Dinesh Manocha
発行日	2023-09-18 14:53:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RECAP: Retrieval-Augmented Audio Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー