Text-to-feature diffusion for audio-visual few-shot learning

要約

視聴覚データからビデオ分類用の深層学習モデルをトレーニングするには、一般に、コストのかかるプロセスを通じて収集された、ラベル付きの大量のトレーニングデータが必要です。
挑戦的で研究が進んでいないものの、はるかに安価なセットアップは、ビデオデータからの少数ショット学習です。
特に、音声および視覚情報を含むビデオデータの本質的なマルチモーダルな性質は、数ショットのビデオ分類タスクにはあまり活用されていません。
したがって、3 つのデータセット (VGGSound-FSL、UCF-FSL、ActivityNet-FSL データセット) に統合されたオーディオビジュアルの少数ショットビデオ分類ベンチマークを導入し、10 の方法を適応させて比較します。
さらに、テキストから特徴への拡散フレームワークである AV-DIFF を提案します。これは、最初にクロスモーダル注意を介して時間的特徴と視聴覚的特徴を融合し、次に新しいクラスのマルチモーダル特徴を生成します。
AV-DIFF が、オーディオビジュアル (一般化された) 少数ショット学習用に提案したベンチマークで最先端のパフォーマンスを獲得することを示します。
私たちのベンチマークは、限られたラベル付きデータしか利用できない場合に効果的なオーディオビジュアル分類への道を開きます。
コードとデータは https://github.com/ExplainableML/AVDIFF-GFSL で入手できます。

要約(オリジナル)

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

arxiv情報

著者	Otniel-Bogdan Mercea,Thomas Hummel,A. Sophia Koepke,Zeynep Akata
発行日	2023-09-07 17:30:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text-to-feature diffusion for audio-visual few-shot learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー