CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

要約

弱監視オーディオビジュアルビデオ解析 (AVVP) 手法は、ビデオレベルのラベルのみを使用して、可聴のみ、可視のみ、および可聴と可視のイベントを検出することを目的としています。
既存のアプローチは、ユニモーダルおよびクロスモーダルのコンテキストを活用することでこれに取り組んでいます。
しかし、クロスモーダル学習は可聴・可視イベントの検出には有益ですが、弱く教師ありのシナリオでは、無関係なモダリティ情報が導入されるため、調整されていない可聴イベントや可視イベントに悪影響を与えると主張します。
この論文では、埋め込み空間におけるクロスモーダルコンテキストの統合を最適化する新しい学習フレームワークである CoLeaF を提案します。これにより、ネットワークは、整列されていないイベントをフィルタリングしながら、可聴イベントと可視イベントのクロスモーダル情報を組み合わせる方法を明示的に学習します。
さらに、ビデオには複雑なクラス関係が含まれることが多いため、それらをモデル化するとパフォーマンスが向上します。
ただし、これによりネットワークに余分な計算コストがかかります。
私たちのフレームワークは、推論時に追加の計算を発生させることなく、トレーニング中にクラス間の関係を活用するように設計されています。
さらに、AVVP を実行する際のメソッドの機能をより適切に評価するための新しい指標を提案します。
私たちの広範な実験により、CoLeaF が、LLP データセットと UnAV-100 データセットでそれぞれ平均 1.9% と 2.4% の F スコアで最先端の結果を大幅に改善することが実証されました。

要約(オリジナル)

Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method’s capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively.

arxiv情報

著者	Faegheh Sardari,Armin Mustafa,Philip J. B. Jackson,Adrian Hilton
発行日	2024-05-20 09:50:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー