STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

要約

様々なオーディオ-ビデオのセマンティクスを時間と共に継続的に学習することは、進化し続けるこの世界において、オーディオ関連の推論タスクにとって極めて重要である。しかし、これは自明な問題ではなく、2つの重要な課題がある：オーディオ-ビデオペア間の疎な時空間相関と、オーディオ-ビデオ関係を忘れてしまうマルチモーダル相関の上書きである。この問題に取り組むために、我々は2つの新しいアイデアを用いた新しい継続的な音声-映像事前学習法を提案する：(1)局所的パッチ重要度スコアリング：各パッチの重要度スコアを決定するためにマルチモーダルエンコーダを導入し、意味的に絡み合ったオーディオ-ビデオパッチを強調する。(2)リプレイ誘導型相関評価：ドリフトによる以前に学習された視聴覚知識の破損を減らすために、過去のステップとの高い相関を示すパッチを識別するために、現在のパッチの過去のステップとの相関を評価することを提案する。この2つのアイデアから得られた結果に基づいて、効果的な継続的視聴覚事前学習のための確率的パッチ選択を行う。複数のベンチマークを用いた実験的検証により、本手法は、強力な継続学習ベースラインと比較して、ゼロショット検索タスクにおいて3.69%pの相対的性能向上を達成し、同時にメモリ消費を〜45%削減することが示された。

要約(オリジナル)

Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.

arxiv情報

著者	Jaewoo Lee,Jaehong Yoon,Wonjae Kim,Yunji Kim,Sung Ju Hwang
発行日	2024-02-02 18:31:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー