CAD — Contextual Multi-modal Alignment for Dynamic AVQA

要約

視聴覚質問応答 (AVQA) タスクのコンテキストでは、視聴覚モダリティは、1) 空間、2) 時間、3) 意味の 3 つのレベルで学習できます。
既存の AVQA 手法には 2 つの大きな欠点があります。
ネットワークを通過するオーディオビジュアル (AV) 情報は、空間レベルと時間レベルで調整されていません。
そして、モーダル間 (音声と視覚) の意味情報は、コンテキスト内でバランスが取れていないことがよくあります。
これによりパフォーマンスが低下します。
この論文では、AVQA 手法の課題に対処する新しいエンドツーエンドのコンテキストマルチモーダルアライメント (CAD) ネットワークを提案します。i) 空間上で堅牢なオーディオとビジュアルのアライメントを保証するパラメータフリーの確率的コンテキストブロックを導入します。
レベル;
ii) 自己監視設定における時間レベルでの動的な音声と視覚の調整のための事前トレーニング手法を提案すること、および iii) 意味レベルで音声と視覚情報のバランスをとるための相互注意メカニズムを導入すること。
提案された新しい CAD ネットワークは、MUSIC-AVQA データセット上で、最先端の方法と比較して全体のパフォーマンスを平均 9.4% 向上させます。
また、我々が提案する AVQA への貢献を既存の手法に追加して、複雑さを追加することなくパフォーマンスを向上できることも示します。

要約(オリジナル)

In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn’t aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

arxiv情報

著者	Asmar Nadeem,Adrian Hilton,Robert Dawes,Graham Thomas,Armin Mustafa
発行日	2023-10-25 16:40:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAD — Contextual Multi-modal Alignment for Dynamic AVQA

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー