Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing

要約

弱監視オーディオビジュアルビデオ解析に関する既存の研究では、クロスモーダルコンテキストをキャプチャするためのマルチモーダル埋め込みとしてハイブリッドアテンションネットワーク (HAN) が採用されています。
共有ネットワークにオーディオおよびビジュアルのモダリティが組み込まれており、入力時にクロスアテンションが実行されます。
しかし、このような早期融合法は、完全に相関していない 2 つのモダリティを高度に絡ませ、単一モダリティイベントの検出において次善のパフォーマンスをもたらします。
この問題に対処するために、融合における無相関のクロスモーダルコンテキストを削減するメッセンジャー誘導ミッドフュージョントランスフォーマーを提案します。
メッセンジャーは、完全なクロスモーダルコンテキストをコンパクトな表現に凝縮して、有用なクロスモーダル情報のみを保持します。
さらに、マイクはあらゆる方向から音声イベントをキャプチャするのに対し、カメラは限られた視野内の視覚イベントのみを記録するという事実により、視覚イベントの予測に対して音声からの不整合なクロスモーダルコンテキストがより頻繁に発生します。
したがって、視覚的イベント予測に対する無関係な音声情報の影響を抑制するために、クロスオーディオ予測の一貫性を提案します。
実験では、既存の最先端の方法と比較して、私たちのフレームワークの優れたパフォーマンスが一貫して示されています。

要約(オリジナル)

Existing works on weakly-supervised audio-visual video parsing adopt hybrid attention network (HAN) as the multi-modal embedding to capture the cross-modal context. It embeds the audio and visual modalities with a shared network, where the cross-attention is performed at the input. However, such an early fusion method highly entangles the two non-fully correlated modalities and leads to sub-optimal performance in detecting single-modality events. To deal with this problem, we propose the messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion. The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information. Furthermore, due to the fact that microphones capture audio events from all directions, while cameras only record visual events within a restricted field of view, there is a more frequent occurrence of unaligned cross-modal context from audio for visual event predictions. We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction. Experiments consistently illustrate the superior performance of our framework compared to existing state-of-the-art methods.

arxiv情報

著者	Yating Xu,Conghui Hu,Gim Hee Lee
発行日	2023-11-14 13:27:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー