Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

要約

自己教師による音源定位は、通常、モダリティの不一致によって困難になります。
最近の研究では、対照的な学習ベースの戦略により、視覚シナリオにおけるオーディオと音源の間の一貫した対応関係を確立することが期待できることが示されています。
残念ながら、異なるモダリティ機能における異質性の影響への注意が不十分であるため、このスキームをさらに改善することは依然として制限されており、これが私たちの研究の動機でもあります。
この研究では、モダリティギャップをより効果的に埋めるための誘導ネットワークが提案されています。
視覚モダリティとオーディオモダリティの勾配を分離することにより、設計された誘導ベクトルを使用してブートストラップ方式で音源の識別的な視覚表現を学習でき、これによりオーディオモダリティを視覚モダリティと一貫して調整することも可能になります。
視覚的に重み付けされたコントラスト損失に加えて、誘導ネットワークの堅牢性を強化するために適応閾値選択戦略が導入されています。
SoundNet-Flickr および VGG-Sound Source データセットに対して行われた実質的な実験により、さまざまな困難なシナリオにおいて、他の最先端の作品と比較して優れたパフォーマンスが実証されました。
コードは https://github.com/Tahy1/AVIN で入手できます。

要約(オリジナル)

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN

arxiv情報

著者	Tianyu Liu,Peng Zhang,Wei Huang,Yufei Zha,Tao You,Yanning Zhang
発行日	2023-08-09 07:55:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー