Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

要約

視聴覚学習は、複数の感覚モダリティを活用することにより、現実の世界をより豊かに理解してモデルを装備していますが、この統合は敵対的な攻撃に対して新しい脆弱性も導入します。
この論文では、時間的およびモダリティ固有の脆弱性の両方を考慮して、視聴覚モデルの敵対的な堅牢性に関する包括的な研究を提示します。
2つの強力な敵対的攻撃を提案します。1）連続した時間セグメントにわたって固有の時間的冗長性を活用する時間的不変攻撃と2）音声と視覚モダリティの不一致をもたらすモダリティの不整合攻撃。
これらの攻撃は、多様な脅威に対する視聴覚モデルの堅牢性を徹底的に評価するように設計されています。
さらに、このような攻撃から防御するために、新しい視聴覚敵の敵対的訓練フレームワークを紹介します。
このフレームワークは、マルチモーダルデータと敵対的なカリキュラム戦略に合わせて調整された効率的な敵対的な摂動クラフトを組み込むことにより、バニラ敵対的訓練における重要な課題に対処します。
速度論的サウンドデータセットでの広範な実験は、モデルのパフォーマンスを低下させるために提案されている時間的およびモダリティベースの攻撃が最先端のパフォーマンスを達成できることを示していますが、敵対的なトレーニングの防御により、敵対的な訓練効率性が大幅に改善されることが示されています。
。

要約(オリジナル)

While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key challenges in vanilla adversarial training by incorporating efficient adversarial perturbation crafting tailored to multi-modal data and an adversarial curriculum strategy. Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency.

arxiv情報

著者	Zeliang Zhang,Susan Liang,Daiki Shimada,Chenliang Xu
発行日	2025-02-17 14:50:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー