Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

要約

オーディオビジュアルイベント (AVE) のローカリゼーションは、近年大きな注目を集めています。
ほとんどの既存の方法は、完全なビデオ (イベントのセグメントレベルの表現とみなすことができます) から分離された各ビデオセグメントを個別にエンコードして分類することに限定されていることがよくあります。
ただし、同じ完全なビデオ (イベントのビデオレベルの表現と考えることができます) 内のイベントの意味上の一貫性は無視されます。
既存の方法とは対照的に、AVE ローカリゼーションタスク用の新しいビデオレベルの意味的一貫性ガイダンスネットワークを提案します。
具体的には、意味的一貫性モデリングのためのビデオレベルの意味情報を探索するイベント意味的一貫性モデリング (ESCM) モジュールを提案します。
これは、クロスモーダルイベント表現エクストラクター (CERE) とイントラモーダルセマンティック一貫性エンハンサー (ISCE) の 2 つのコンポーネントで構成されます。
CERE は、ビデオレベルでイベントの意味情報を取得するために提案されています。
さらに、ISCE は、ビデオレベルのイベントセマンティクスを事前知識として取得し、モデルが各モダリティ内のイベントのセマンティックな連続性に焦点を当てるように導きます。
さらに、ネットワークが無関係なセグメントペアをフィルタリングすることを促すための新しい負のペアフィルター損失と、弱く監視された設定でのイベントの異なるカテゴリ間のギャップをさらに拡大するための新しいスムーズな損失を提案します。
私たちは公開されている AVE データセットで広範な実験を実行し、完全に監視された設定と弱く監視された設定の両方で最先端の手法を上回るパフォーマンスを示し、その結果、私たちの手法の有効性が検証されました。コードは https://github.com で入手できます。
/ブラボー5542/VSCG。

要約(オリジナル)

Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.The code is available at https://github.com/Bravo5542/VSCG.

arxiv情報

著者	Yuanyuan Jiang,Jianqin Yin,Yonghao Dang
発行日	2023-10-20 08:48:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー